soton_corenlppy.lexico.lexicon_bootstrapping_lib module¶
Lexicon bootstrapping library
-
soton_corenlppy.lexico.lexicon_bootstrapping_lib.
bootstrap_lexicon
(seed_lexicon, topic_sets=None, tfidf_model=None, corpus_dictionary=None, threshold_score=-1.0, bootstrap_iterations=1, term_degree=1, stemmer=<PorterStemmer>, hypo_depth=3, hyper_depth=1, entail_depth=3, dict_lexicon_config=None)[source]¶ run bootstrapping algorithm to expand a lexicon using WordNet. on each iteration optionally filter tokens using pre-computed topic sets or a TF-IDF model to improve lexicon precision. seed lexicon can contain plain text tokens (without any period characters), which will be kept but not expanded. this allows specialist vocabulary outside WordNet to be added to the lexicon.
- Parameters
seed_lexicon (set) – set of seed WordNet synset names OR plain text for lexicon e.g. set( [ ‘red.s.01’, ‘ruby red’ ] ). synsets will be expanded, plain text will not.
topic_sets (list) – topic set calculated using lexicon_bootstrap_lib.generate_topic_set(). if None no topic model filtering will be applied
tfidf_model (gensim.models.TfidfModel) – pre-computed topic model for filtering. if None no TF-IDF filtering will be applied
corpus_dictionary (gensim.corpora.dictionary.Dictionary) – dictionary of corpus { term_index : term_token ] to be used with tf-idf model
threshold_score (float) – minimum TF-IDF threshold score for a term to be added to a topic. A value < 0 means no threshold to be applied.
bootstrap_iterations (int) – number of bootstrap iterations. each iteration will use the expanded lexicon as a seed. iterations beyond 1 risk losing lexicon precision but increase recall.
term_degree (int) – use 1st or 2nd degree terms as calculated by lexicon_bootstrap_lib.calc_topic_degree_lists()
stemmer (nltk.stem.api.StemmerI) – stemmer to us. A value of None will mean stemming is not applied to tokens.
hypo_depth (int) – how deep to follow WordNet inherited hyponyms
hyper_depth (int) – how deep to follow WordNet inherited hypernyms
entail_depth (int) – how deep to follow WordNet inherited entailments
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
expanded sey of WordNet synset names OR plain text for lexicon e.g. set( [ ‘ruby red’, ‘red.s.01’,’reddish.s.01’ ] )
- Return type
-
soton_corenlppy.lexico.lexicon_bootstrapping_lib.
build_lda_model
(bag_of_words, corpus_dictionary, num_topics=100, num_passes=- 1, chunksize=4096, dict_lexicon_config=None)[source]¶ build a LDA topic model for a corpus | default parameters taken from Hoffman 2010 paper | see https://radimrehurek.com/gensim/models/ldamodel.html
- Parameters
bag_of_words (list) – bag of words representation of corpus. [ [ (term_index,term_freq), … N_term_in_doc ] … N_doc ]
corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]
num_topics (int) – number of LDA topics
num_passes (int) – number of LDA passes (-1 to calculate based on corpus document size)
chunksize (int) – size of chunks per pass
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
LDA model
- Return type
gensim.models.ldamodel.LdaModel
-
soton_corenlppy.lexico.lexicon_bootstrapping_lib.
build_lsa_model
(bag_of_words, corpus_dictionary, num_topics=100, chunksize=40000, onepass=False, power_iters=3, extra_samples=400, dict_lexicon_config=None)[source]¶ build a LDA topic model for a corpus | default parameters taken from Hoffman 2010 paper | see https://radimrehurek.com/gensim/models/lsimodel.html
- Parameters
bag_of_words (list) – bag of words representation of corpus. [ [ (term_index,term_freq), … N_term_in_doc ] … N_doc ]
corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]
num_topics (int) – number of LDA topics
chunksize (int) – size of chunks per pass
onepass (bool) – True if one pass only
power_iters (int) – power iterations for multi-pass
extra_samples (int) – oversampling factor for multipass
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
LDA model
- Return type
gensim.models.lsimodel.LsiModel
-
soton_corenlppy.lexico.lexicon_bootstrapping_lib.
build_tfidf_model
(bag_of_words, corpus_dictionary, dict_lexicon_config=None)[source]¶ build a TF-IDF model for a corpus | see https://radimrehurek.com/gensim/models/tfidfmodel.html
- Parameters
bag_of_words (list) – bag of words representation of corpus. [ [ (term_index,term_freq), … N_term_in_doc ] … N_doc ]
corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
TF-IDF model
- Return type
gensim.models.TfidfModel
-
soton_corenlppy.lexico.lexicon_bootstrapping_lib.
calc_topic_degree_lists
(set_lexicon, list_topic_sets, term_degree=1, stemmer=None, dict_lexicon_config=None)[source]¶ for all topics with seed words get a list of topics they appear in and a set of 1st degree topically relevant words. for topically relevant words get the topics they appear in, and a set of 2nd degree topically relevant words | e.g. statue -> topic [ statue, terracotta ] -> statue, terracotta, head, figure, hand, broken toe -> topic [ head, crown ], [ figure, zeus ] -> head, crown, figure, zeus
- Parameters
set_lexicon (set) – set of WordNet lexicon synsets and lemma names
list_topic_sets (list) – list of topics, each of which is a list of terms
term_degree (int) – use 1st or 2nd degree terms as calculated by lexicon_bootstrap_lib.calc_topic_degree_lists()
stemmer (nltk.stemmer) – stemmer to use for result terms (or None)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
1st and 2nd degree topic lists = (set_terms_1st_degree,set_terms_2nd_degree)
- Return type
-
soton_corenlppy.lexico.lexicon_bootstrapping_lib.
generate_topic_set
(model, corpus_dictionary, threshold_score=- 1.0, top_n=100, num_topics=100, dict_lexicon_config=None)[source]¶ construct a topic set of filtered terms using a pre-calculated gensim model (LDA or LSA)
- Parameters
or gensim.models.lsimodel.LsiModel model (gensim.models.ldamodel.LdaModel) – pre-computed topic model
corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]
threshold_score (float) – minimum threshold score for a term to be added to a topic. A value < 0 means the current default threshold score is used for LDA or LSI.
top_n (int) – maximum number of terms per topic. there might be less than this number if terms fail thrshold check or there are simply insufficient available for a topic.
num_topics (int) – number of topics in pre-computed model
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
list of topics, each being a list of terms for that topic
- Return type