soton_corenlppy.lexico.lexicon_bootstrapping_lib module¶

Lexicon bootstrapping library

soton_corenlppy.lexico.lexicon_bootstrapping_lib.bootstrap_lexicon(seed_lexicon, topic_sets=None, tfidf_model=None, corpus_dictionary=None, threshold_score=-1.0, bootstrap_iterations=1, term_degree=1, stemmer=<PorterStemmer>, hypo_depth=3, hyper_depth=1, entail_depth=3, dict_lexicon_config=None)[source]¶

run bootstrapping algorithm to expand a lexicon using WordNet. on each iteration optionally filter tokens using pre-computed topic sets or a TF-IDF model to improve lexicon precision. seed lexicon can contain plain text tokens (without any period characters), which will be kept but not expanded. this allows specialist vocabulary outside WordNet to be added to the lexicon.

Parameters

seed_lexicon (set) – set of seed WordNet synset names OR plain text for lexicon e.g. set( [ ‘red.s.01’, ‘ruby red’ ] ). synsets will be expanded, plain text will not.
topic_sets (list) – topic set calculated using lexicon_bootstrap_lib.generate_topic_set(). if None no topic model filtering will be applied
tfidf_model (gensim.models.TfidfModel) – pre-computed topic model for filtering. if None no TF-IDF filtering will be applied
corpus_dictionary (gensim.corpora.dictionary.Dictionary) – dictionary of corpus { term_index : term_token ] to be used with tf-idf model
threshold_score (float) – minimum TF-IDF threshold score for a term to be added to a topic. A value < 0 means no threshold to be applied.
bootstrap_iterations (int) – number of bootstrap iterations. each iteration will use the expanded lexicon as a seed. iterations beyond 1 risk losing lexicon precision but increase recall.
term_degree (int) – use 1st or 2nd degree terms as calculated by lexicon_bootstrap_lib.calc_topic_degree_lists()
stemmer (nltk.stem.api.StemmerI) – stemmer to us. A value of None will mean stemming is not applied to tokens.
hypo_depth (int) – how deep to follow WordNet inherited hyponyms
hyper_depth (int) – how deep to follow WordNet inherited hypernyms
entail_depth (int) – how deep to follow WordNet inherited entailments
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

expanded sey of WordNet synset names OR plain text for lexicon e.g. set( [ ‘ruby red’, ‘red.s.01’,’reddish.s.01’ ] )

Return type

set

soton_corenlppy.lexico.lexicon_bootstrapping_lib.build_lda_model(bag_of_words, corpus_dictionary, num_topics=100, num_passes=- 1, chunksize=4096, dict_lexicon_config=None)[source]¶

build a LDA topic model for a corpus | default parameters taken from Hoffman 2010 paper | see https://radimrehurek.com/gensim/models/ldamodel.html

Parameters

bag_of_words (list) – bag of words representation of corpus. [ [ (term_index,term_freq), … N_term_in_doc ] … N_doc ]
corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]
num_topics (int) – number of LDA topics
num_passes (int) – number of LDA passes (-1 to calculate based on corpus document size)
chunksize (int) – size of chunks per pass
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

LDA model

Return type

gensim.models.ldamodel.LdaModel

soton_corenlppy.lexico.lexicon_bootstrapping_lib.build_lsa_model(bag_of_words, corpus_dictionary, num_topics=100, chunksize=40000, onepass=False, power_iters=3, extra_samples=400, dict_lexicon_config=None)[source]¶

build a LDA topic model for a corpus | default parameters taken from Hoffman 2010 paper | see https://radimrehurek.com/gensim/models/lsimodel.html

Parameters

bag_of_words (list) – bag of words representation of corpus. [ [ (term_index,term_freq), … N_term_in_doc ] … N_doc ]
corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]
num_topics (int) – number of LDA topics
chunksize (int) – size of chunks per pass
onepass (bool) – True if one pass only
power_iters (int) – power iterations for multi-pass
extra_samples (int) – oversampling factor for multipass
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

LDA model

Return type

gensim.models.lsimodel.LsiModel

soton_corenlppy.lexico.lexicon_bootstrapping_lib.build_tfidf_model(bag_of_words, corpus_dictionary, dict_lexicon_config=None)[source]¶

build a TF-IDF model for a corpus | see https://radimrehurek.com/gensim/models/tfidfmodel.html

Parameters

bag_of_words (list) – bag of words representation of corpus. [ [ (term_index,term_freq), … N_term_in_doc ] … N_doc ]
corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

TF-IDF model

Return type

gensim.models.TfidfModel

soton_corenlppy.lexico.lexicon_bootstrapping_lib.calc_topic_degree_lists(set_lexicon, list_topic_sets, term_degree=1, stemmer=None, dict_lexicon_config=None)[source]¶

for all topics with seed words get a list of topics they appear in and a set of 1st degree topically relevant words. for topically relevant words get the topics they appear in, and a set of 2nd degree topically relevant words | e.g. statue -> topic [ statue, terracotta ] -> statue, terracotta, head, figure, hand, broken toe -> topic [ head, crown ], [ figure, zeus ] -> head, crown, figure, zeus

Parameters

set_lexicon (set) – set of WordNet lexicon synsets and lemma names
list_topic_sets (list) – list of topics, each of which is a list of terms
term_degree (int) – use 1st or 2nd degree terms as calculated by lexicon_bootstrap_lib.calc_topic_degree_lists()
stemmer (nltk.stemmer) – stemmer to use for result terms (or None)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

1st and 2nd degree topic lists = (set_terms_1st_degree,set_terms_2nd_degree)

Return type

tuple

soton_corenlppy.lexico.lexicon_bootstrapping_lib.generate_topic_set(model, corpus_dictionary, threshold_score=- 1.0, top_n=100, num_topics=100, dict_lexicon_config=None)[source]¶

construct a topic set of filtered terms using a pre-calculated gensim model (LDA or LSA)

Parameters

or gensim.models.lsimodel.LsiModel model (gensim.models.ldamodel.LdaModel) – pre-computed topic model
corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]
threshold_score (float) – minimum threshold score for a term to be added to a topic. A value < 0 means the current default threshold score is used for LDA or LSI.
top_n (int) – maximum number of terms per topic. there might be less than this number if terms fail thrshold check or there are simply insufficient available for a topic.
num_topics (int) – number of topics in pre-computed model
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

list of topics, each being a list of terms for that topic

Return type

list

soton_corenlppy.lexico.lexicon_bootstrapping_lib module¶

soton_corenlppy

Navigation

Related Topics