soton_corenlppy.lexico.lexicon_lib module¶

Lexicon bootstrapping library

soton_corenlppy.lexico.lexicon_lib.append_to_lexicon(dict_uri={}, dict_phrase={}, phrase_list=[], schema_uri=None, hyponymn_uri_list=[], lower_case=True, stemmer=None, apply_wordnet_morphy=False, dict_lexicon_config=None)[source]¶

append an new entry to an existing lexicon. this method will add information to dict_uri and dict_phrase.

Parameters

dict_uri (dict) – URI dict structure
dict_phrase (dict) – phrase dict structure
phrase_list (list) – list of new phrases to add
schema_uri (unicode) – schema URI for this phrase list
hyponymn_uri_list (list) – list of hyponymn phrase URI’s for this phrase list
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

soton_corenlppy.lexico.lexicon_lib.calc_corpus_dictionary(list_token_sets, stoplist=[], stemmer=None, dict_lexicon_config=None)[source]¶

calculate a corpus dictionary and bag of words representation for corpus

Parameters

list_token_sets (list) – list of tokenized sents in corpus, each sent represented as a list of tokens
stoplist (list) – list of stopwords applied to sent tokens to remove them
stemmer (nltk.stemmer) – stemmer to use for tokens (or None)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

(gensim.corpora.dictionary.Dictionary(), list_bag_of_words)

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.export_lexicon(filename_lexicon=None, dict_uri=None, dict_phrase=None, dict_lexicon_config=None)[source]¶

export lexicon to file

Parameters

filename_lexicon (str) – file to serialize lexicon to
dict_uri (dict) – URI dict structure
dict_phrase (dict) – phrase dict structure
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

soton_corenlppy.lexico.lexicon_lib.filter_lexicon_wordnet(dict_phrase=None, set_ignore_hyper={}, pos='asrnv', lang='eng', count_freq_min=5, dict_lexicon_config=None)[source]¶

filter lexicon to remove all phrases that are Wordnet terms with a common frequency. this allows specialist lexicons, with a very specific wordsense, to delete phrases that can have many other wordsenses and thus avoid false positives when using this lexison. this method will delete entries from dict_phrase.

note: if stemming is applied to lexicon then there will be no matches for stemmed words (e.g. ‘lotus’ > ‘lotu’ != wordnet match to ‘lotus’).

Parameters

dict_phrase (dict) – phrase dict structure
set_ignore_hyper (set) – set of hypernymns (wordnet synset names) whose hyponyms should not be used for filtering (e.g. material.n.01)
pos (str) – WordNet POS filter
lang (str) – WordNet language
count_freq_min (int) – minimun WordNet lemma count frequency below which a WordNet lemma is not used as a filter. this is so only common words are filtered out. set to 0 to filter any wordnet word
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

soton_corenlppy.lexico.lexicon_lib.get_lexicon_config(**kwargs)[source]¶

return a lexicon config object.

note: a config object approach is used, as opposed to a global variable, to allow lexicon_lib functions to work in a multi-threaded environment

Returns: configuration settings to be used by all lexicon_lib functions
Return type: dict

soton_corenlppy.lexico.lexicon_lib.import_NELL_lexicon(filename_nell=None, lower_case=False, stemmer=None, apply_wordnet_morphy=False, allowed_schema_list=None, dict_lexicon_config=None)[source]¶

import from tab delimited NELL KBP CSV file a lexicon.

Parameters

filename_nell (str) – tab delimited file from NELL
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
allowed_schema_list (list) – filter list of allowed schema values for imported phrases (default None which allows any schema URI)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.import_lexicon(filename_lexicon=None, dict_lexicon_config=None)[source]¶

import lexicon to file from a file serialized using lexicon_lib.export_lexicon()

Parameters

filename_lexicon (str) – file to serialize lexicon from
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.import_plain_lexicon(filename_lemma=None, list_column_names=['col1', 'schema', 'col3', 'col4', 'phrase_list'], phrase_delimiter='|', lower_case=True, stemmer=None, apply_wordnet_morphy=False, allowed_schema_list=None, dict_lexicon_config=None)[source]¶

import from tab delimited CSV file a lexicon. lines beginning with # are ignored columns must have a ‘schema’ and ‘phrase_list’ entry. phrases are delimited by the phrase_delimiter e.g. ‘phase one | phrase two’ optionally there can also be a ‘hypernym’ entry, which is a space list of phrase URIs phrase URI’s are constructed from the schema and phrase tokens applied to ASCII urllib.quote_plus(). e.g. schema ‘http://example.org/id/part’ + phrase ‘left shoulder’ == ‘http://example.org/id/part#left+shoulder’

Parameters

filename_lemma (str) – tab delimited file with schema URI and phrases
list_column_names (list) – names of the columns. columns other than ‘schema’, ‘phrase_list’ and ‘hypernym’ are ignored
phrase_delimiter (str) – character delimiting phrases in phrase_list
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
allowed_schema_list (list) – filter list of allowed schema values for imported phrases (default None which allows any schema URI)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.import_skos_lexicon(filename_lemma=None, filename_hypernym=None, filename_related=None, serialized_format='json', lower_case=True, stemmer=None, apply_wordnet_morphy=False, allowed_schema_list=None, dict_lexicon_config=None)[source]¶

import from file a serialized result of a SPARQL query over a SKOS vocabulary. files can be serialized in JSON or CSV tab delimited format. the result is an in-memory index for lexicon terms.

filename_lemma should report SPARQL query variables ?skos_concept ?scheme ?label
filename_hypernym should report SPARQL query variables ?skos_concept ?hypernym
filename_related should report SPARQL query variables ?skos_concept ?related

example queries for file_lemma

SELECT DISTINCT ?skos_concept ?scheme ?label
WHERE {
        ?skos_concept rdf:type skos:Concept .
        OPTIONAL {
                ?skos_concept skos:inScheme ?scheme
        }
        { ?skos_concept skos:prefLabel ?label } UNION { ?skos_concept skos:altLabel ?label }
}
ORDER BY ?skos_concept ?scheme ?label

example queries for file_hypernym

SELECT DISTINCT ?skos_concept ?hypernym
WHERE {
        ?skos_concept rdf:type skos:Concept .
        ?skos_concept skos:broader ?hypernym
}
ORDER BY ?skos_concept ?hypernym

example queries for file_related

SELECT DISTINCT ?skos_concept ?related
WHERE {
        ?skos_concept rdf:type skos:Concept .
        ?skos_concept skos:related ?related
}
ORDER BY ?skos_concept ?related

Parameters

filename_lemma (str) – file for lemma SPARQL query result
filename_hypernym (str) – file for hypernym SPARQL query result (can be None)
filename_related (str) – file for related SPARQL query result (can be None)
serialized_format (str) – format of files = json|csv
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
allowed_schema_list (list) – filter list of allowed schema values for imported phrases (default None which allows any schema URI)
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.load_plain_vocabulary(filename_vocab=None, phrase_delimiter=';', lower_case=True, stemmer=None, apply_wordnet_morphy=False, dict_lexicon_config=None)[source]¶

import from plan text file a vocabulary. lines beginning with # are ignored phrases are delimited by the phrase_delimiter and whitespace at start and end is stripped e.g. ‘phase one ; phrase two’ -> [‘phrase one’,’phrase two’]

Parameters

filename_vocab (str) – file with vocabulary
phrase_delimiter (str) – character delimiting phrases in phrase_list
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

list of phrases in vocabulary

Return type

list

soton_corenlppy.lexico.lexicon_lib.load_sparql_query_results(filename_results=None, serialized_format='json', list_variable_names=[], dict_lexicon_config=None)[source]¶

load and parse a sparql query result file. internal function called by lexicon_lib.import_skos_lexicon()

Parameters

filename_results (str) – filename to load
serialized_format (str) – format of files = json|csv
list_variable_names (list) – sparql variable names to expect
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

dict of results indexed by the first variable value. a list is kepts as each 1st variable value might occur more than once. { var1 : [ [var2,var3,…], [var2,var3,…], … ] }

Return type

dict

soton_corenlppy.lexico.lexicon_lib.merge_lexicon(list_lexicon=None, dict_lexicon_config=None)[source]¶

merge N lexicons, aggregating each phrase entry. the first schemaURI found in lexicon list (processed top to bottom) is assigned to each phrase URI

Parameters

list_lex_phrase_index (list) – list of lexicon tuples = [ ( dict_uri, dict_phrase ), … ]
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.phrase_lookup(phrase_tokens=None, head_token=None, lex_phrase_index=None, lex_uri_index=None, max_gram=5, stemmer=None, apply_wordnet_morphy=False, hyphen_variant=False, dict_lexicon_config=None)[source]¶

perform an n-gram lookup of phrases, optionally based around a head token. return the lexicon phrase matches with a confidence score. the confidence score is based on the percentage of tokens in an extracted phrase that match the lexicon phrase.

Parameters

phrase_tokens (list) – tokenized phrase to lookup in lexicon
head_token (unicode) – head token in phrase which must be in any ngram lookup. the default None allows all possible ngarms to be looked up.
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
max_gram (int) – maximum phrase gram size to check for matches in lexicon. larger gram sizes means more lexicon checks, which is slower.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to the last phrase token
hyphen_variant (bool) – if True lookup phrase as it is, and a version with hypens replaced by space characters.
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

lexicon matches to phrase = [ ( lexicon_uri, schema_uri, matched_phrase, match_gram_size, confidence_score ) ]

Return type

list

soton_corenlppy.lexico.lexicon_lib.read_noun_type_ranked_list(filename=None, dict_openie_config={})[source]¶

read a noun type list for use as an allowed_schema_list (see import_skos_lexicon() and other import functions)

Parameters

filename (unicode) – filename for ranked list
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

list of

Return type

list

soton_corenlppy.lexico.lexicon_lib.sent_set_lookup(sent_token_set=None, lex_phrase_index=None, lex_uri_index=None, lower_case=True, max_gram=5, stemmer=None, apply_wordnet_morphy=False, hyphen_variant=False, dict_lexicon_config=None)[source]¶

apply phrase_lookup() to all n-gram phrases in a set of sentences, returining True if any phrases match lexicon

Parameters

sent_token_set (list) – list of token sets from soton_corenlppy.common_parse_lib.unigram_tokenize_text_with_sent_breakdown()
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
max_gram (int) – maximum phrase gram size to check for matches in lexicon. larger gram sizes means more lexicon checks, which is slower.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)
apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to the last phrase token
hyphen_variant (bool) – if True lookup phrase as it is, and a version with hypens replaced by space characters.
dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

True if at least one phrase in the sentence set appears in the lexicon

Return type

bool

soton_corenlppy.lexico.lexicon_lib module¶

soton_corenlppy

Navigation

Related Topics