soton_corenlppy.lexico.lexicon_lib module¶
Lexicon bootstrapping library
-
soton_corenlppy.lexico.lexicon_lib.
append_to_lexicon
(dict_uri={}, dict_phrase={}, phrase_list=[], schema_uri=None, hyponymn_uri_list=[], lower_case=True, stemmer=None, apply_wordnet_morphy=False, dict_lexicon_config=None)[source]¶ append an new entry to an existing lexicon. this method will add information to dict_uri and dict_phrase.
- Parameters
dict_uri (dict) – URI dict structure
dict_phrase (dict) – phrase dict structure
phrase_list (list) – list of new phrases to add
schema_uri (unicode) – schema URI for this phrase list
hyponymn_uri_list (list) – list of hyponymn phrase URI’s for this phrase list
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
-
soton_corenlppy.lexico.lexicon_lib.
calc_corpus_dictionary
(list_token_sets, stoplist=[], stemmer=None, dict_lexicon_config=None)[source]¶ calculate a corpus dictionary and bag of words representation for corpus
- Parameters
list_token_sets (list) – list of tokenized sents in corpus, each sent represented as a list of tokens
stoplist (list) – list of stopwords applied to sent tokens to remove them
stemmer (nltk.stemmer) – stemmer to use for tokens (or None)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
(gensim.corpora.dictionary.Dictionary(), list_bag_of_words)
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
export_lexicon
(filename_lexicon=None, dict_uri=None, dict_phrase=None, dict_lexicon_config=None)[source]¶ export lexicon to file
-
soton_corenlppy.lexico.lexicon_lib.
filter_lexicon_wordnet
(dict_phrase=None, set_ignore_hyper={}, pos='asrnv', lang='eng', count_freq_min=5, dict_lexicon_config=None)[source]¶ filter lexicon to remove all phrases that are Wordnet terms with a common frequency. this allows specialist lexicons, with a very specific wordsense, to delete phrases that can have many other wordsenses and thus avoid false positives when using this lexison. this method will delete entries from dict_phrase.
note: if stemming is applied to lexicon then there will be no matches for stemmed words (e.g. ‘lotus’ > ‘lotu’ != wordnet match to ‘lotus’).
- Parameters
dict_phrase (dict) – phrase dict structure
set_ignore_hyper (set) – set of hypernymns (wordnet synset names) whose hyponyms should not be used for filtering (e.g. material.n.01)
pos (str) – WordNet POS filter
lang (str) – WordNet language
count_freq_min (int) – minimun WordNet lemma count frequency below which a WordNet lemma is not used as a filter. this is so only common words are filtered out. set to 0 to filter any wordnet word
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
-
soton_corenlppy.lexico.lexicon_lib.
get_lexicon_config
(**kwargs)[source]¶ return a lexicon config object.
note: a config object approach is used, as opposed to a global variable, to allow lexicon_lib functions to work in a multi-threaded environment- Returns
configuration settings to be used by all lexicon_lib functions
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
import_NELL_lexicon
(filename_nell=None, lower_case=False, stemmer=None, apply_wordnet_morphy=False, allowed_schema_list=None, dict_lexicon_config=None)[source]¶ import from tab delimited NELL KBP CSV file a lexicon.
- Parameters
filename_nell (str) – tab delimited file from NELL
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
allowed_schema_list (list) – filter list of allowed schema values for imported phrases (default None which allows any schema URI)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
import_lexicon
(filename_lexicon=None, dict_lexicon_config=None)[source]¶ import lexicon to file from a file serialized using lexicon_lib.export_lexicon()
- Parameters
- Returns
( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
import_plain_lexicon
(filename_lemma=None, list_column_names=['col1', 'schema', 'col3', 'col4', 'phrase_list'], phrase_delimiter='|', lower_case=True, stemmer=None, apply_wordnet_morphy=False, allowed_schema_list=None, dict_lexicon_config=None)[source]¶ import from tab delimited CSV file a lexicon. lines beginning with # are ignored columns must have a ‘schema’ and ‘phrase_list’ entry. phrases are delimited by the phrase_delimiter e.g. ‘phase one | phrase two’ optionally there can also be a ‘hypernym’ entry, which is a space list of phrase URIs phrase URI’s are constructed from the schema and phrase tokens applied to ASCII urllib.quote_plus(). e.g. schema ‘http://example.org/id/part’ + phrase ‘left shoulder’ == ‘http://example.org/id/part#left+shoulder’
- Parameters
filename_lemma (str) – tab delimited file with schema URI and phrases
list_column_names (list) – names of the columns. columns other than ‘schema’, ‘phrase_list’ and ‘hypernym’ are ignored
phrase_delimiter (str) – character delimiting phrases in phrase_list
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
allowed_schema_list (list) – filter list of allowed schema values for imported phrases (default None which allows any schema URI)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
import_skos_lexicon
(filename_lemma=None, filename_hypernym=None, filename_related=None, serialized_format='json', lower_case=True, stemmer=None, apply_wordnet_morphy=False, allowed_schema_list=None, dict_lexicon_config=None)[source]¶ import from file a serialized result of a SPARQL query over a SKOS vocabulary. files can be serialized in JSON or CSV tab delimited format. the result is an in-memory index for lexicon terms.
filename_lemma should report SPARQL query variables ?skos_concept ?scheme ?labelfilename_hypernym should report SPARQL query variables ?skos_concept ?hypernymfilename_related should report SPARQL query variables ?skos_concept ?relatedexample queries for file_lemma
SELECT DISTINCT ?skos_concept ?scheme ?label WHERE { ?skos_concept rdf:type skos:Concept . OPTIONAL { ?skos_concept skos:inScheme ?scheme } { ?skos_concept skos:prefLabel ?label } UNION { ?skos_concept skos:altLabel ?label } } ORDER BY ?skos_concept ?scheme ?label
example queries for file_hypernym
SELECT DISTINCT ?skos_concept ?hypernym WHERE { ?skos_concept rdf:type skos:Concept . ?skos_concept skos:broader ?hypernym } ORDER BY ?skos_concept ?hypernym
example queries for file_related
SELECT DISTINCT ?skos_concept ?related WHERE { ?skos_concept rdf:type skos:Concept . ?skos_concept skos:related ?related } ORDER BY ?skos_concept ?related
- Parameters
filename_lemma (str) – file for lemma SPARQL query result
filename_hypernym (str) – file for hypernym SPARQL query result (can be None)
filename_related (str) – file for related SPARQL query result (can be None)
serialized_format (str) – format of files = json|csv
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
allowed_schema_list (list) – filter list of allowed schema values for imported phrases (default None which allows any schema URI)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
load_plain_vocabulary
(filename_vocab=None, phrase_delimiter=';', lower_case=True, stemmer=None, apply_wordnet_morphy=False, dict_lexicon_config=None)[source]¶ import from plan text file a vocabulary. lines beginning with # are ignored phrases are delimited by the phrase_delimiter and whitespace at start and end is stripped e.g. ‘phase one ; phrase two’ -> [‘phrase one’,’phrase two’]
- Parameters
filename_vocab (str) – file with vocabulary
phrase_delimiter (str) – character delimiting phrases in phrase_list
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
list of phrases in vocabulary
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
load_sparql_query_results
(filename_results=None, serialized_format='json', list_variable_names=[], dict_lexicon_config=None)[source]¶ load and parse a sparql query result file. internal function called by lexicon_lib.import_skos_lexicon()
- Parameters
- Returns
dict of results indexed by the first variable value. a list is kepts as each 1st variable value might occur more than once. { var1 : [ [var2,var3,…], [var2,var3,…], … ] }
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
merge_lexicon
(list_lexicon=None, dict_lexicon_config=None)[source]¶ merge N lexicons, aggregating each phrase entry. the first schemaURI found in lexicon list (processed top to bottom) is assigned to each phrase URI
- Parameters
- Returns
( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
phrase_lookup
(phrase_tokens=None, head_token=None, lex_phrase_index=None, lex_uri_index=None, max_gram=5, stemmer=None, apply_wordnet_morphy=False, hyphen_variant=False, dict_lexicon_config=None)[source]¶ perform an n-gram lookup of phrases, optionally based around a head token. return the lexicon phrase matches with a confidence score. the confidence score is based on the percentage of tokens in an extracted phrase that match the lexicon phrase.
- Parameters
phrase_tokens (list) – tokenized phrase to lookup in lexicon
head_token (unicode) – head token in phrase which must be in any ngram lookup. the default None allows all possible ngarms to be looked up.
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
max_gram (int) – maximum phrase gram size to check for matches in lexicon. larger gram sizes means more lexicon checks, which is slower.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to the last phrase token
hyphen_variant (bool) – if True lookup phrase as it is, and a version with hypens replaced by space characters.
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
lexicon matches to phrase = [ ( lexicon_uri, schema_uri, matched_phrase, match_gram_size, confidence_score ) ]
- Return type
-
soton_corenlppy.lexico.lexicon_lib.
read_noun_type_ranked_list
(filename=None, dict_openie_config={})[source]¶ read a noun type list for use as an allowed_schema_list (see import_skos_lexicon() and other import functions)
-
soton_corenlppy.lexico.lexicon_lib.
sent_set_lookup
(sent_token_set=None, lex_phrase_index=None, lex_uri_index=None, lower_case=True, max_gram=5, stemmer=None, apply_wordnet_morphy=False, hyphen_variant=False, dict_lexicon_config=None)[source]¶ apply phrase_lookup() to all n-gram phrases in a set of sentences, returining True if any phrases match lexicon
- Parameters
sent_token_set (list) – list of token sets from soton_corenlppy.common_parse_lib.unigram_tokenize_text_with_sent_breakdown()
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
max_gram (int) – maximum phrase gram size to check for matches in lexicon. larger gram sizes means more lexicon checks, which is slower.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to the last phrase token
hyphen_variant (bool) – if True lookup phrase as it is, and a version with hypens replaced by space characters.
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()
- Returns
True if at least one phrase in the sentence set appears in the lexicon
- Return type