soton_corenlppy.lexico.lexicon_bootstrapping_lib module

Lexicon bootstrapping library

soton_corenlppy.lexico.lexicon_bootstrapping_lib.bootstrap_lexicon(seed_lexicon, topic_sets=None, tfidf_model=None, corpus_dictionary=None, threshold_score=-1.0, bootstrap_iterations=1, term_degree=1, stemmer=<PorterStemmer>, hypo_depth=3, hyper_depth=1, entail_depth=3, dict_lexicon_config=None)[source]

run bootstrapping algorithm to expand a lexicon using WordNet. on each iteration optionally filter tokens using pre-computed topic sets or a TF-IDF model to improve lexicon precision. seed lexicon can contain plain text tokens (without any period characters), which will be kept but not expanded. this allows specialist vocabulary outside WordNet to be added to the lexicon.

Parameters
  • seed_lexicon (set) – set of seed WordNet synset names OR plain text for lexicon e.g. set( [ ‘red.s.01’, ‘ruby red’ ] ). synsets will be expanded, plain text will not.

  • topic_sets (list) – topic set calculated using lexicon_bootstrap_lib.generate_topic_set(). if None no topic model filtering will be applied

  • tfidf_model (gensim.models.TfidfModel) – pre-computed topic model for filtering. if None no TF-IDF filtering will be applied

  • corpus_dictionary (gensim.corpora.dictionary.Dictionary) – dictionary of corpus { term_index : term_token ] to be used with tf-idf model

  • threshold_score (float) – minimum TF-IDF threshold score for a term to be added to a topic. A value < 0 means no threshold to be applied.

  • bootstrap_iterations (int) – number of bootstrap iterations. each iteration will use the expanded lexicon as a seed. iterations beyond 1 risk losing lexicon precision but increase recall.

  • term_degree (int) – use 1st or 2nd degree terms as calculated by lexicon_bootstrap_lib.calc_topic_degree_lists()

  • stemmer (nltk.stem.api.StemmerI) – stemmer to us. A value of None will mean stemming is not applied to tokens.

  • hypo_depth (int) – how deep to follow WordNet inherited hyponyms

  • hyper_depth (int) – how deep to follow WordNet inherited hypernyms

  • entail_depth (int) – how deep to follow WordNet inherited entailments

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

expanded sey of WordNet synset names OR plain text for lexicon e.g. set( [ ‘ruby red’, ‘red.s.01’,’reddish.s.01’ ] )

Return type

set

soton_corenlppy.lexico.lexicon_bootstrapping_lib.build_lda_model(bag_of_words, corpus_dictionary, num_topics=100, num_passes=- 1, chunksize=4096, dict_lexicon_config=None)[source]

build a LDA topic model for a corpus | default parameters taken from Hoffman 2010 paper | see https://radimrehurek.com/gensim/models/ldamodel.html

Parameters
  • bag_of_words (list) – bag of words representation of corpus. [ [ (term_index,term_freq), … N_term_in_doc ] … N_doc ]

  • corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]

  • num_topics (int) – number of LDA topics

  • num_passes (int) – number of LDA passes (-1 to calculate based on corpus document size)

  • chunksize (int) – size of chunks per pass

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

LDA model

Return type

gensim.models.ldamodel.LdaModel

soton_corenlppy.lexico.lexicon_bootstrapping_lib.build_lsa_model(bag_of_words, corpus_dictionary, num_topics=100, chunksize=40000, onepass=False, power_iters=3, extra_samples=400, dict_lexicon_config=None)[source]

build a LDA topic model for a corpus | default parameters taken from Hoffman 2010 paper | see https://radimrehurek.com/gensim/models/lsimodel.html

Parameters
  • bag_of_words (list) – bag of words representation of corpus. [ [ (term_index,term_freq), … N_term_in_doc ] … N_doc ]

  • corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]

  • num_topics (int) – number of LDA topics

  • chunksize (int) – size of chunks per pass

  • onepass (bool) – True if one pass only

  • power_iters (int) – power iterations for multi-pass

  • extra_samples (int) – oversampling factor for multipass

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

LDA model

Return type

gensim.models.lsimodel.LsiModel

soton_corenlppy.lexico.lexicon_bootstrapping_lib.build_tfidf_model(bag_of_words, corpus_dictionary, dict_lexicon_config=None)[source]

build a TF-IDF model for a corpus | see https://radimrehurek.com/gensim/models/tfidfmodel.html

Parameters
  • bag_of_words (list) – bag of words representation of corpus. [ [ (term_index,term_freq), … N_term_in_doc ] … N_doc ]

  • corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

TF-IDF model

Return type

gensim.models.TfidfModel

soton_corenlppy.lexico.lexicon_bootstrapping_lib.calc_topic_degree_lists(set_lexicon, list_topic_sets, term_degree=1, stemmer=None, dict_lexicon_config=None)[source]

for all topics with seed words get a list of topics they appear in and a set of 1st degree topically relevant words. for topically relevant words get the topics they appear in, and a set of 2nd degree topically relevant words | e.g. statue -> topic [ statue, terracotta ] -> statue, terracotta, head, figure, hand, broken toe -> topic [ head, crown ], [ figure, zeus ] -> head, crown, figure, zeus

Parameters
  • set_lexicon (set) – set of WordNet lexicon synsets and lemma names

  • list_topic_sets (list) – list of topics, each of which is a list of terms

  • term_degree (int) – use 1st or 2nd degree terms as calculated by lexicon_bootstrap_lib.calc_topic_degree_lists()

  • stemmer (nltk.stemmer) – stemmer to use for result terms (or None)

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

1st and 2nd degree topic lists = (set_terms_1st_degree,set_terms_2nd_degree)

Return type

tuple

soton_corenlppy.lexico.lexicon_bootstrapping_lib.generate_topic_set(model, corpus_dictionary, threshold_score=- 1.0, top_n=100, num_topics=100, dict_lexicon_config=None)[source]

construct a topic set of filtered terms using a pre-calculated gensim model (LDA or LSA)

Parameters
  • or gensim.models.lsimodel.LsiModel model (gensim.models.ldamodel.LdaModel) – pre-computed topic model

  • corpus_dictionary (gensim.corpora.dictionary.Dictionary()) – dictionary of corpus { term_index : term_token ]

  • threshold_score (float) – minimum threshold score for a term to be added to a topic. A value < 0 means the current default threshold score is used for LDA or LSI.

  • top_n (int) – maximum number of terms per topic. there might be less than this number if terms fail thrshold check or there are simply insufficient available for a topic.

  • num_topics (int) – number of topics in pre-computed model

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

list of topics, each being a list of terms for that topic

Return type

list