soton_corenlppy.lexico.lexicon_lib module

Lexicon bootstrapping library

soton_corenlppy.lexico.lexicon_lib.append_to_lexicon(dict_uri={}, dict_phrase={}, phrase_list=[], schema_uri=None, hyponymn_uri_list=[], lower_case=True, stemmer=None, apply_wordnet_morphy=False, dict_lexicon_config=None)[source]

append an new entry to an existing lexicon. this method will add information to dict_uri and dict_phrase.

Parameters
  • dict_uri (dict) – URI dict structure

  • dict_phrase (dict) – phrase dict structure

  • phrase_list (list) – list of new phrases to add

  • schema_uri (unicode) – schema URI for this phrase list

  • hyponymn_uri_list (list) – list of hyponymn phrase URI’s for this phrase list

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)

  • apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

soton_corenlppy.lexico.lexicon_lib.calc_corpus_dictionary(list_token_sets, stoplist=[], stemmer=None, dict_lexicon_config=None)[source]

calculate a corpus dictionary and bag of words representation for corpus

Parameters
  • list_token_sets (list) – list of tokenized sents in corpus, each sent represented as a list of tokens

  • stoplist (list) – list of stopwords applied to sent tokens to remove them

  • stemmer (nltk.stemmer) – stemmer to use for tokens (or None)

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

(gensim.corpora.dictionary.Dictionary(), list_bag_of_words)

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.export_lexicon(filename_lexicon=None, dict_uri=None, dict_phrase=None, dict_lexicon_config=None)[source]

export lexicon to file

Parameters
  • filename_lexicon (str) – file to serialize lexicon to

  • dict_uri (dict) – URI dict structure

  • dict_phrase (dict) – phrase dict structure

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

soton_corenlppy.lexico.lexicon_lib.filter_lexicon_wordnet(dict_phrase=None, set_ignore_hyper={}, pos='asrnv', lang='eng', count_freq_min=5, dict_lexicon_config=None)[source]

filter lexicon to remove all phrases that are Wordnet terms with a common frequency. this allows specialist lexicons, with a very specific wordsense, to delete phrases that can have many other wordsenses and thus avoid false positives when using this lexison. this method will delete entries from dict_phrase.

note: if stemming is applied to lexicon then there will be no matches for stemmed words (e.g. ‘lotus’ > ‘lotu’ != wordnet match to ‘lotus’).

Parameters
  • dict_phrase (dict) – phrase dict structure

  • set_ignore_hyper (set) – set of hypernymns (wordnet synset names) whose hyponyms should not be used for filtering (e.g. material.n.01)

  • pos (str) – WordNet POS filter

  • lang (str) – WordNet language

  • count_freq_min (int) – minimun WordNet lemma count frequency below which a WordNet lemma is not used as a filter. this is so only common words are filtered out. set to 0 to filter any wordnet word

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

soton_corenlppy.lexico.lexicon_lib.get_lexicon_config(**kwargs)[source]

return a lexicon config object.

note: a config object approach is used, as opposed to a global variable, to allow lexicon_lib functions to work in a multi-threaded environment
Returns

configuration settings to be used by all lexicon_lib functions

Return type

dict

soton_corenlppy.lexico.lexicon_lib.import_NELL_lexicon(filename_nell=None, lower_case=False, stemmer=None, apply_wordnet_morphy=False, allowed_schema_list=None, dict_lexicon_config=None)[source]

import from tab delimited NELL KBP CSV file a lexicon.

Parameters
  • filename_nell (str) – tab delimited file from NELL

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)

  • apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon

  • allowed_schema_list (list) – filter list of allowed schema values for imported phrases (default None which allows any schema URI)

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.import_lexicon(filename_lexicon=None, dict_lexicon_config=None)[source]

import lexicon to file from a file serialized using lexicon_lib.export_lexicon()

Parameters
  • filename_lexicon (str) – file to serialize lexicon from

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.import_plain_lexicon(filename_lemma=None, list_column_names=['col1', 'schema', 'col3', 'col4', 'phrase_list'], phrase_delimiter='|', lower_case=True, stemmer=None, apply_wordnet_morphy=False, allowed_schema_list=None, dict_lexicon_config=None)[source]

import from tab delimited CSV file a lexicon. lines beginning with # are ignored columns must have a ‘schema’ and ‘phrase_list’ entry. phrases are delimited by the phrase_delimiter e.g. ‘phase one | phrase two’ optionally there can also be a ‘hypernym’ entry, which is a space list of phrase URIs phrase URI’s are constructed from the schema and phrase tokens applied to ASCII urllib.quote_plus(). e.g. schema ‘http://example.org/id/part’ + phrase ‘left shoulder’ == ‘http://example.org/id/part#left+shoulder

Parameters
  • filename_lemma (str) – tab delimited file with schema URI and phrases

  • list_column_names (list) – names of the columns. columns other than ‘schema’, ‘phrase_list’ and ‘hypernym’ are ignored

  • phrase_delimiter (str) – character delimiting phrases in phrase_list

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)

  • apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon

  • allowed_schema_list (list) – filter list of allowed schema values for imported phrases (default None which allows any schema URI)

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.import_skos_lexicon(filename_lemma=None, filename_hypernym=None, filename_related=None, serialized_format='json', lower_case=True, stemmer=None, apply_wordnet_morphy=False, allowed_schema_list=None, dict_lexicon_config=None)[source]

import from file a serialized result of a SPARQL query over a SKOS vocabulary. files can be serialized in JSON or CSV tab delimited format. the result is an in-memory index for lexicon terms.

filename_lemma should report SPARQL query variables ?skos_concept ?scheme ?label
filename_hypernym should report SPARQL query variables ?skos_concept ?hypernym
filename_related should report SPARQL query variables ?skos_concept ?related

example queries for file_lemma

SELECT DISTINCT ?skos_concept ?scheme ?label
WHERE {
        ?skos_concept rdf:type skos:Concept .
        OPTIONAL {
                ?skos_concept skos:inScheme ?scheme
        }
        { ?skos_concept skos:prefLabel ?label } UNION { ?skos_concept skos:altLabel ?label }
}
ORDER BY ?skos_concept ?scheme ?label

example queries for file_hypernym

SELECT DISTINCT ?skos_concept ?hypernym
WHERE {
        ?skos_concept rdf:type skos:Concept .
        ?skos_concept skos:broader ?hypernym
}
ORDER BY ?skos_concept ?hypernym

example queries for file_related

SELECT DISTINCT ?skos_concept ?related
WHERE {
        ?skos_concept rdf:type skos:Concept .
        ?skos_concept skos:related ?related
}
ORDER BY ?skos_concept ?related
Parameters
  • filename_lemma (str) – file for lemma SPARQL query result

  • filename_hypernym (str) – file for hypernym SPARQL query result (can be None)

  • filename_related (str) – file for related SPARQL query result (can be None)

  • serialized_format (str) – format of files = json|csv

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)

  • apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon

  • allowed_schema_list (list) – filter list of allowed schema values for imported phrases (default None which allows any schema URI)

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.load_plain_vocabulary(filename_vocab=None, phrase_delimiter=';', lower_case=True, stemmer=None, apply_wordnet_morphy=False, dict_lexicon_config=None)[source]

import from plan text file a vocabulary. lines beginning with # are ignored phrases are delimited by the phrase_delimiter and whitespace at start and end is stripped e.g. ‘phase one ; phrase two’ -> [‘phrase one’,’phrase two’]

Parameters
  • filename_vocab (str) – file with vocabulary

  • phrase_delimiter (str) – character delimiting phrases in phrase_list

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)

  • apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to find try and base word of a phrase to use in lexicon

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

list of phrases in vocabulary

Return type

list

soton_corenlppy.lexico.lexicon_lib.load_sparql_query_results(filename_results=None, serialized_format='json', list_variable_names=[], dict_lexicon_config=None)[source]

load and parse a sparql query result file. internal function called by lexicon_lib.import_skos_lexicon()

Parameters
  • filename_results (str) – filename to load

  • serialized_format (str) – format of files = json|csv

  • list_variable_names (list) – sparql variable names to expect

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

dict of results indexed by the first variable value. a list is kepts as each 1st variable value might occur more than once. { var1 : [ [var2,var3,…], [var2,var3,…], … ] }

Return type

dict

soton_corenlppy.lexico.lexicon_lib.merge_lexicon(list_lexicon=None, dict_lexicon_config=None)[source]

merge N lexicons, aggregating each phrase entry. the first schemaURI found in lexicon list (processed top to bottom) is assigned to each phrase URI

Parameters
  • list_lex_phrase_index (list) – list of lexicon tuples = [ ( dict_uri, dict_phrase ), … ]

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

( dict_uri, dict_phrase ), where dict_uri = { uri : [ scheme_uri, set_hypernym_uri, set_related_uri ] } and dict_phrase = { phrase : set_uri }. several phrases can share a phrase_uri

Return type

tuple

soton_corenlppy.lexico.lexicon_lib.phrase_lookup(phrase_tokens=None, head_token=None, lex_phrase_index=None, lex_uri_index=None, max_gram=5, stemmer=None, apply_wordnet_morphy=False, hyphen_variant=False, dict_lexicon_config=None)[source]

perform an n-gram lookup of phrases, optionally based around a head token. return the lexicon phrase matches with a confidence score. the confidence score is based on the percentage of tokens in an extracted phrase that match the lexicon phrase.

Parameters
  • phrase_tokens (list) – tokenized phrase to lookup in lexicon

  • head_token (unicode) – head token in phrase which must be in any ngram lookup. the default None allows all possible ngarms to be looked up.

  • lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • max_gram (int) – maximum phrase gram size to check for matches in lexicon. larger gram sizes means more lexicon checks, which is slower.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)

  • apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to the last phrase token

  • hyphen_variant (bool) – if True lookup phrase as it is, and a version with hypens replaced by space characters.

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

lexicon matches to phrase = [ ( lexicon_uri, schema_uri, matched_phrase, match_gram_size, confidence_score ) ]

Return type

list

soton_corenlppy.lexico.lexicon_lib.read_noun_type_ranked_list(filename=None, dict_openie_config={})[source]

read a noun type list for use as an allowed_schema_list (see import_skos_lexicon() and other import functions)

Parameters
  • filename (unicode) – filename for ranked list

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

list of

Return type

list

soton_corenlppy.lexico.lexicon_lib.sent_set_lookup(sent_token_set=None, lex_phrase_index=None, lex_uri_index=None, lower_case=True, max_gram=5, stemmer=None, apply_wordnet_morphy=False, hyphen_variant=False, dict_lexicon_config=None)[source]

apply phrase_lookup() to all n-gram phrases in a set of sentences, returining True if any phrases match lexicon

Parameters
  • sent_token_set (list) – list of token sets from soton_corenlppy.common_parse_lib.unigram_tokenize_text_with_sent_breakdown()

  • lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • max_gram (int) – maximum phrase gram size to check for matches in lexicon. larger gram sizes means more lexicon checks, which is slower.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)

  • apply_wordnet_morphy (bool) – if True apply wordnet.morphy() to the last phrase token

  • hyphen_variant (bool) – if True lookup phrase as it is, and a version with hypens replaced by space characters.

  • dict_lexicon_config (dict) – config object returned from lexicon_lib.get_lexicon_config()

Returns

True if at least one phrase in the sentence set appears in the lexicon

Return type

bool