soton_corenlppy.re.comp_sem_lib module¶

soton_corenlppy.re.comp_sem_lib.annotate_using_pos_patterns(list_sent_trees=None, list_phrase_sequence_patterns_exec_order=['ENTITY', 'ENTITY_LIST', 'ENTITY_AND_CONTEXT', 'RELATION'], dict_phrase_sequence_patterns={'ENTITY': [re.compile('\\A.*?(?P<ENTITY>\\((PROPER_NOUN_P|NOUN_P) [^)]*\\)( \\((PROPER_NOUN_P|NOUN_P) [^)]*\\)){0,5})', re.DOTALL), re.compile('\\A.*?(?P<ENTITY>\\(PRONOUN_P [^)]*\\))', re.DOTALL)], 'ENTITY_AND_CONTEXT': [re.compile('\\A.*?(?P<ENTITY_AND_CONTEXT>\\((ENTITY|ENTITY_LIST) [^)]*\\) \\(PREPOSITION_P [^)]*\\)( \\(DETERMINER_P [^)]*\\)){0,1} \\((ENTITY|ENTITY_LIST) [^)]*\\))', re.DOTALL)], 'ENTITY_LIST': [re.compile('\\A.*?(?P<ENTITY_LIST>\\((ENTITY) [^)]*\\)( (\\((\\,|\\;|and) [^)]*\\) ){0,1}\\((ENTITY) [^)]*\\)){1,5})', re.DOTALL)], 'RELATION': [re.compile('\\A.*?(?P<RELATION>\\((ENTITY|ENTITY_LIST|ENTITY_AND_CONTEXT) [^)]*\\) (\\(VERB_AUXILLARY_P [^)]*\\) ){0,1}\\(VERB_P [^)]*\\)( \\(VERB_AUXILLARY_P [^)]*\\)){0,1}( \\((ENTITY|ENTITY_AND_CONTEXT|ADJECT, re.DOTALL), re.compile('\\A.*?(?P<RELATION>\\((ENTITY|ENTITY_LIST|ENTITY_AND_CONTEXT) [^)]*\\) (\\((VERB_AUXILLARY_P|RP|MD|ADVERB_P) [^)]*\\) ){0,1}\\(VERB_P [^)]*\\)( \\((VERB_AUXILLARY_P|RP|MD|ADVERB_P) [^)]*\\)){0,1} \\(, re.DOTALL), re.compile('\\A.*?(?P<RELATION>(\\(VERB_AUXILLARY_P [^)]*\\) ){0,1}\\(VERB_P [^)]*\\)( \\(VERB_AUXILLARY_P [^)]*\\)){0,1}( \\((ENTITY|ENTITY_AND_CONTEXT|ADJECTIVE_P|ADVERB_P|DETERMINER_P) [^)]*\\)){0,5}( \\(RP [, re.DOTALL), re.compile('\\A.*?(?P<RELATION>(\\((VERB_AUXILLARY_P|RP|MD|ADVERB_P) [^)]*\\) ){0,1}\\(VERB_P [^)]*\\)( \\((VERB_AUXILLARY_P|RP|MD|ADVERB_P) [^)]*\\)){0,1} \\((ENTITY|ENTITY_LIST|ENTITY_AND_CONTEXT) [^)]*\\))', re.DOTALL)]}, dict_openie_config=None)[source]¶

Apply on a set of tagged sent trees a set of pos patterns (e.g. for ReVerb arguments and relations or CH entities, lists of entities, attributes and relations). The result is a sent with POS pattern annotations represented as nltk.Tree elements.

Parameters

list_sent_trees (list) – list of nltk.Tree representing the sents in a doc
list_phrase_sequence_patterns_exec_order (list) – order that phrase sequence patterns should be executed (usually most permissive last). see openie.list_phrase_sequence_patterns_exec_order_default
dict_phrase_sequence_patterns (dict) – phrase sequence patterns of compiled regex extracted groups with same name as pattern = { pattern : [ regex, regex, … ] }. see openie.dict_phrase_sequence_patterns_default for example
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of documents, each containing a list of nltk.Tree objects representing an extracted lexico-pos pattern for entity, attribute and relation phrases

Return type

soton_corenlppy.re.comp_sem_lib.blank_serialized_tree(tree_sent=None, dict_openie_config=None)[source]¶

create a version of the serialized tree that blanks out all deep structures just for matching this allows regex matches to be a lot simpler and not sorry about n-deep () structures the size is identical to the original sent so we can use the match character positions later e.g. original = (S (ATTRIBUTE (NOUN_P handle plates) (PREPOSITION_P of) (NOUN_P column kraters)) (VERB_P attributed) …) blanked = (S (ATTRIBUTE ) (VERB_P ) …)

Parameters

treeSent (nltk.Tree) – nltk.Tree representing a sent = nltk.Tree( ‘(S (IN For) (NN handle) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ )
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

serialized tree blanked

Return type

unicode

soton_corenlppy.re.comp_sem_lib.calc_graph_paths_connecting_targets(list_targets=None, dep_graph=None, start_address=0, list_shortest_path=[], longest_dep_path=32, longest_inter_target_walk=2, avoid_dep_set={}, node_branch_index=None, dict_openie_config=None)[source]¶

calculate a graph walk path that connects all targets. this algorithm uses a fast depth first exploration of dep branches, as opposed to a slower random walk approach. the order of targets found is not important as we want to allow all sequences of targets (e.g. arg rel arg AND rel arg arg ).

Parameters

list_targets (list) – list of tuples (address, var_name, var_type) to find on graph walk in sequential order they must be found
dep_graph (nltk.parse.DependencyGraph) – dependancy graph for walk
start_address (int) – root address in graph branch to walk
list_shortest_path (list) – shortest path so far
longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.
longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probably meaningless
avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)
node_branch_index (dict) – index of branch addresses under each node (to make walk more efficient by avoiding branches without the target node)
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

shortest path connecting all targets = [ (address, var_name, var_type), … ]

Return type

soton_corenlppy.re.comp_sem_lib.check_for_negation(dep_graph=None, node_address=None, lang='en', dict_assert_true={'en': ['true', 'genuine', 'real', 'confirmed', 'verified']}, dict_assert_false={'en': ['false', 'fake', 'hoax', 'joke', 'trick', 'debunked']}, dict_openie_config=None)[source]¶

check dependency graph for evidence of negation of node or genuine/fake claims with this node as the subject negation strategy:

check neg dep

check amod dep leading to a true|false assertion (and then check for a neg dep)

check head for [nsubj,nsubjpass] dep originating from a true|false assertion (and then check head for a neg dep)

Parameters

dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
node_address (int) – address is the token index in the dependency graph
lang (str) – ISO 639-1 2 character language codes e.g. [‘en’,’fr’]
dict_assert_true (dict) – dict of language specific vocabulary for tokens asserting truth. use {} to avoid using a negation vocabulary.
dict_assert_false (dict) – dict of language specific vocabulary for tokens asserting falsehood. use {} to avoid using a negation vocabulary.
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

negation status = (negated, genuine) = (true|None, true|false|None)

Return type

soton_corenlppy.re.comp_sem_lib.collapse_graph_to_make_phrase(dep_graph=None, node_address=None, allowed_dep_set=None, forbidden_address_set={}, variable_head_range=None, search_depth=0, dict_openie_config=None)[source]¶

collapse a dependency graph and generate a set of text tokens representing a phrase for this node. ensure all tokens appear sequentially around the root node address (e.g. ‘the only other plan’ -> ‘the other plan’ is not allowed, ‘only other plan’ is allowed)

Parameters

dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
node_address (int) – address is the token index in the dependency graph
allowed_dep_set (set) – set of allowed dependancy types to include in the collapsed phrase. supports lexical constraints (e.g. ‘case:of’) for a finer grained filtering. None to allow any dep.
forbidden_address_set (set) – set of forbidden dependancy graph addresses to avoid in the collapsed phrase
variable_head_range (int) – tuple (first var head addr, last var head addr) for use with metadata commands (‘not_before_first_var’, ‘not_after_last_var’)
search_depth (int) – internal argument, recursion depth
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

success or failure of pattern matching. the reference argument pattern_result will contain the variables matched if this is successful.

Return type

bool

soton_corenlppy.re.comp_sem_lib.construct_node_index(dep_graph=None, dict_openie_config=None)[source]¶

compute a complete list of the addresses under each node. this is important to guide the graph walk later internal method called by generate_open_extraction_templates()

Parameters

dep_graph (nltk.parse.DependencyGraph) – nltk.parse.DependencyGraph object
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

index of each address containing node addresses under it in tree { addr : [ childaddr1, childaddr2 … ] }

Return type

soton_corenlppy.re.comp_sem_lib.construct_seed_addr_options(seed_options=None, seed_graph_address_output=None, dict_openie_config=None)[source]¶

calculate seed address walks given a set of seed address options in a nested structure. internal method called by generate_open_extraction_templates()

Parameters

seed_options (dict) – dict of seed options
seed_graph_address_output (list) – output list which will be populated with seed walk options
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.encode_extraction(list_extracted_vars=None, dep_graph=None, set_var_types={}, dict_pretty_dep_rels={'after_head': ['case', 'case:of', 'case:by'], 'any': ['compound', 'amod', 'nummod', 'advmod', 'cop', 'appos', 'dep', 'conj', 'nmod', 'xcomp'], 'before_head': []}, space_replacement_char='_', dict_openie_config=None)[source]¶

encode an extraction in a serialize format that can be parsed using comp_sem_lib.parse_encoded_extraction()

Parameters

list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()
dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
set_var_types (set) – set of {var} types for pretty print
dict_pretty_dep_rels (dict) – dict of dep rels to allow in pretty print based on address position relative to head { ‘any’ : [], ‘not_before_head’ : [],’not_after_head’ : [] }
space_replacement_char (str) – replacement char for all token spaces
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

encoded extraction

Return type

unicode

soton_corenlppy.re.comp_sem_lib.escape_extraction_pattern(token=None)[source]¶

escape extraction pattern tokens. replacing | with -ESC_PIPE-, : with -ESC_COLON-, {} with -ESC_LCB- and -ESC_RCB-, [] with -ESC_LSB- and -ESC_LCB-. the escape replacement labels are deliberately different from Stanford escaping to avoid conflicts

Parameters: token (unicode) – token to escape
Returns: escaped token
Return type: unicode

soton_corenlppy.re.comp_sem_lib.exec_dep_parser(list_tagged_sents=None, dep_parser_cmd=None, dict_openie_config=None, timeout=300, sigterm_handler=False)[source]¶

exec a java command line (created from get_dependency_parser()) using popen to run the Stanford dependency parser. pipes are used to avoid any need for file IO

Parameters

list_tagged_sents (list) – list of tagged sents from soton_corenlppy.common_parse_lib.pos_tag_tokenset()
dep_parser_cmd (list) – list of commands for popen() from get_dependency_parser()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()
timeout (int) – timeout in seconds for Dep Parser process in the unlikely event the POS tagger hangs
sigterm_handler (bool) – if True SIGTERM will be setup to terminate the process handle before exit

Returns

list of nltk.parse.DependencyGraph objects (one per sent)

Return type

soton_corenlppy.re.comp_sem_lib.exec_stanford_corenlp(dict_text=None, work_dir=None, annotators='tokenize,ssplit,pos,depparse,lemma,ner', option_list=['-tokenize.options', 'asciiQuotes=true,americanize=false', '-ssplit.eolonly', 'true'], num_processes=6, dict_openie_config=None)[source]¶

run stanford CoreNLP to do one or more of the following:

Tokenization
POS tagging
Dependancy parsing
NER

Parameters

dict_text (dict) – set of sentences to process, where key is str(sent index)
work_dir (unicode) – directory to work in (will serialize sentences for stanford to work on)
annotators (unicode) – list of stanford coreNLP annotators to use (see https://stanfordnlp.github.io/CoreNLP/cmdline.html#inputting-serialized-files)
option_list (list) – options to pass on command line for stanford coreNLP (see https://stanfordnlp.github.io/CoreNLP/cmdline.html#inputting-serialized-files)
num_processes (int) – number of processors to use (will set thread options for stanford coreNLP)
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple of requested information e.g. ( dict_tokens, dict_pos, dict_dep_graph, dict_ner ). all dict use sent_index as the key.

Return type

soton_corenlppy.re.comp_sem_lib.extract_annotations_from_sents(list_sent_trees=None, set_annotations={}, pos_tags=False, include_start_and_end_annotations=False, dict_openie_config=None)[source]¶

extract a set of annotations from a list of sent trees. for example extracting argument and relation annotations from ReVerb style annotated sent trees.

Parameters

list_sent_trees (list) – list of nltk.Tree representing the sents in that doc = [ nltk.Tree( ‘(S (IN For) (NN handle) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ ), … ]
set_annotations (set) – set of allowed annotation labels e.g. set( [‘RELATION’,’ARGUMENT’] )
pos_tags (bool) – if True use nltk.Tree.pos() to return tuples (‘token’,’pos’) not just strings ‘token’ after the node label
include_start_and_end_annotations (bool) – if True add START and END annotations if the sent start and end immediately preceeds or suffixes a matching annotation
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of sent extractions, each a list of annotation tuples (type,token) extracted in the order they appear in the sent e.g. [ [ (‘ARGUMENT’,’John’), (‘RELATION’,’come’,’from’), (‘ARGUMENT’,’Paris’) ], … ]

Return type

soton_corenlppy.re.comp_sem_lib.filter_extractions(dep_graph=None, list_extractions=[], filter_strategy='min_semantic_drift_per_target', use_context=False, max_context=4, min_var_connection=2, max_semantic_drift=4, target_var_type=None, dict_sem_drift={'appos': 2, 'cc': 2, 'conj': 2, 'dep': 0, 'dislocated': 2, 'list': 2, 'parataxis': 2, 'punct': 2, 'remnant': 2}, dict_openie_config=None)[source]¶

filter a set of extractions for a sent, to get a more focussed set without redundant and overlapping extractions.

filtering strategy - segment_coordinating_conj:

idea - ensures extractions cover the parts of a proposition, avoiding over-large propositions and making it easier to extract individual entity attributes (good for knowledge-base production methods)
segment sent address range by coordinating conjunction [CC , :] = (addr_start, addr_end)
for each address segment find smallest extractions that fully span this address range, favouring highest variable count if multiple options exist
if none exist get extractions that cover as many of the individual addresses, faviouring extractions whose size fits best to the segment size

filtering strategy - segment_subsumption:

idea - ensures extractions contain full context, providing large propositions that are easu for humans to understand (good for human intelligence gathering)
select extractions whose tokens fully subsume another extraction,favouring highest variable count if multiple options exist
head address only used for subsumption checks

filtering strategy - min_semantic_drift_per_target:

idea - ensures there is an extraction for each instance of a target var type (e.g. verb-mediated rel in sent)
for each target var instance select extractions containing it that have the lowest max inter-var path between variables (not context) and the target var. a preference for widest address range is used to differentiate between extractions with same max inter-var path.

filtering strategy - threshold_semantic_drift_per_target:

idea - ensures there is an extraction for each instance of a target var type (e.g. verb-mediated rel in sent)
for each target var instance select extractions containing it that have a <= threshold inter-var path between variables (not context) and the target var.

confidence value:

the confidence value appended to end of each extraction (higher numbers are better)
confidence = number of extractions allowed / total number of extractions [high good]

Parameters

dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
list_extractions (list) – list of extractions from comp_sem_lib.match_extraction_patterns()
filter_strategy (unicode) – name of filter strategy to apply. Can also be None for no filtering.
use_context (bool) – use context variables as targets when applying strategy (default is False so only non context variables are considered for intervar distances and address ranges)
max_context (int) – max number of context variables to allow per extraction (can be none)
min_var_connection (int) – min number of variables connected to target for extraction to be considered (filter prior to looking at semantic drift) (can be none)
max_semantic_drift (int) – max semantic drift between variables (can be none)
target_var_type (unicode) – name of target var type (for min_semantic_drift_per_target strategy)
dict_sem_drift (dict) – dict of semantic drift costs for dep types e.g. { ‘conj’ : 2 }
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = (list_extractions_filtered, list_conf). list_extractions_filtered is a filtered list of extractions using same format as comp_sem_lib.match_extraction_patterns(). list_conf is a list of float confidence values per extraction, high value = good

Return type

soton_corenlppy.re.comp_sem_lib.filter_open_extraction_templates_using_relevance_feedback(list_parsed_patterns=None, list_doc_set_of_propositions=None, list_relevance_feedback=None, dict_openie_config=None)[source]¶

filter a set of parsed open pattern templates generated by comp_sem_lib.generate_open_extraction_templates() using relevance feedback. relevance feedback is provided in the form of a list of scored extractions. any template which creates an extraction, which is incorrect will be removed from the list provided.

Parameters

list_parsed_patterns (list) – list of parsed open template extraction patterns generated by comp_sem_lib.parse_extraction_pattern()
list_doc_set_of_propositions (list) – list of tuples = ( str_index_doc, list_phrases_prop, parsed_pattern_index, list_prop_pattern, list_head_text ). parsed_pattern_index is index of pattern within list_parsed_patterns.
list_relevance_feedback (list) – list of tuples = ( str_index_doc, list_phrases_prop, str_score ). a score of 0 is incorrect, a score of 1 is correct.
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.filter_proposition_set(list_proposition_set=None, list_proposition_set_conf=None, target_index=1, filter_strategy='prop_subsumption', dict_index_stoplist_prefix={0: ['and ', 'or ', 'of ', 'to ', 'in ', 'into ', 'for ', 'at ', "'s ", 'by ', 'on ', 'him '], 1: ['and ', 'or '], 2: ['and ', 'or ']}, dict_index_stoplist_suffix={0: [' and', ' or'], 1: [' and', ' or'], 2: [' and', ' or', ' those', ' he', ' she', ' they', ' the', ' a', ' would']}, lex_phrase_index=None, lex_uri_index=None, dict_openie_config=None)[source]¶

filter a proposition set selecting the best target. filtered entries are deleted from list_proposition_set

filter strategy:

min_length - group propositions at target_index which share a common head address. sort this group by address length (target_index first, then other indexes). select top of list (min length) as single option.
max_length - group propositions at target_index which share a common head address. sort this group by address length (target_index first, then other indexes). select bottom of list (max length) as single option.
prop_subsumption - remove any n-gram proposition that is subsumed by a higher gram proposition.
lexicon_filter - remove any n-gram proposition which does not at least 1 variable with a lexicon phrase match (unigram, bigram and trigrams are checked with morphy).

Parameters

list_proposition_set (list) – list of tuples obtained from calls to comp_sem_lib.generate_proposition_set_from_extraction(). filtered entries will be removed from this set.
list_proposition_set_conf (list) – list of confidence values associated with each proposition. filtered entries will be removed from this set.
target_index (int) – proposition index of target variable to base filtering on (e,.g. index of rel)
filter_strategy (str) – filter strategy = min_length|max_length
index_stoplist_prefix (dict) – dict of prefixes for each index to not allow e.g. ‘of ‘ on first argument of {arg,rel,arg}
index_stoplist_suffix (dict) – dict of suffixes for each index to not allow e.g. ‘ and’ on rel argument of {arg,rel,arg}
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.generate_open_extraction_templates(seed_tuples=None, var_candidates=None, corpus_sent_graphs=None, dict_seed_to_template_mappings={}, dict_context_dep_types=[], longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, space_replacement_char='_', dict_openie_config=None)[source]¶

generate a set of open pattern templates based on a training corpus (dependency parsed into graphs) and seed_tuples with known ‘high quality’ argument and relation groups.

for each seed_tuple generate specific open pattern templates:

for each sent, get all possible combinations of seed tokens where the seed tokens appear in the same sequential order as the seed tuple
remove any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch). subsumed seeds will be includes later when branches are collapsed.
for each combination of seed tokens, compute the shortest dependancy graph path which contains all seed tuple words
reject dependancy path lengths > threshold, as too verbose and unlikely to express the true original tuple’s meaning
walk the dependancy path and generate a very specific open pattern templates (lexical and pos constraints)

Parameters

seed_tuples (list) – list (or set) of seed_tuples from comp_sem_lib.generate_seed_tuples()
var_candidates (dict) – dict of seed tuple variable types, each containing a list of phrases that are var candidates, from comp_sem_lib.generate_seed_tuples()
corpus_sent_graphs (list) – list of nltk.parse.DependencyGraph objects for a corpus of sents
dict_seed_to_template_mappings (dict) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’). template type must not contain a ‘_’ character.
dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)
longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.
longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probably meaningless
max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.
allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)
avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)
space_replacement_char (str) – replacement char for all token spaces as dep graph cannot have a space. should be same as prepare_tags_for_dependency_parse()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

soton_corenlppy.re.comp_sem_lib.generate_open_extraction_templates_batch(seed_tuples=None, var_candidates=None, dict_document_sent_graphs={}, dict_seed_to_template_mappings={}, dict_context_dep_types=[], max_processes=4, longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, dict_openie_config=None)[source]¶

generate a set of open pattern templates based on a training corpus (dependency parsed into graphs) and seed_tuples with known ‘high quality’ argument and relation groups. use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive.

see comp_sem_lib.generate_open_extraction_templates() for details

Parameters

seed_tuples (list) – list (or set) of seed_tuples from comp_sem_lib.generate_seed_tuples()
var_candidates (dict) – dict of seed tuple variable types, each value containing a list of phrase tuples that are var candidates, from comp_sem_lib.generate_seed_tuples()
dict_document_sent_graphs (dict) – dict of document ID keys, each value being a list of nltk.parse.DependencyGraph objects for a corpus of sents
dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)
dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.
longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probably meaningless
max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.
allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)
avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

soton_corenlppy.re.comp_sem_lib.generate_open_extraction_templates_worker(seed_tuples=None, var_candidates=None, tuple_queue=None, dict_seed_to_template_mappings={}, dict_context_dep_types=[], longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, pause_on_start=0, process_id=0, dict_openie_config=None)[source]¶

worker thread for comp_sem_lib.generate_open_extraction_templates_batch()

Parameters

seed_tuples (list) – list (or set) of seed_tuples from comp_sem_lib.generate_seed_tuples()
var_candidates (dict) – dict of seed tuple variable types, each containing a list of phrases that are var candidates, from comp_sem_lib.generate_seed_tuples()
tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). the queueIn has serialized nltk.parse.DependencyGraph objects. queueOut has list of template patterns for this graph.
dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)
dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)
longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.
longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probably meaningless
max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.
allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)
avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)
pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)
process_id (int) – process ID for logging
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.generate_proposition_set_from_extraction(list_extracted_vars=None, dep_graph=None, proposition_pattern=['arg', 'rel', 'arg'], dict_displaced_context={'arg': [], 'rel': ['nsubj', 'nsubjpass', 'dobj', 'iobj', 'csubj', 'csubjpass', 'xcomp', 'nmod', 'nmod:*', 'advcl', 'advcl:*', 'neg', 'nfincl', 'nfincl:*', 'ncmod', 'ncmod:*', 'acl', 'acl:*', 'vocative', 'discourse', 'expl', 'aux', 'auxpass', 'cop', 'mark', 'punct', 'nummod', 'appos', 'nmod', 'nmod:*', 'relcl', 'nfincl', 'nfincl:*', 'ncmod', 'ncmod:*', 'amod', 'det', 'neg', 'compound', 'compound:*', 'name', 'mwe', 'foreign', 'goeswith', 'conj', 'cc', 'dep']}, max_semantic_dist=None, include_context=True, space_replacement_char='_', dict_sem_drift={'appos': 2, 'cc': 2, 'conj': 2, 'dep': 0, 'dislocated': 2, 'list': 2, 'parataxis': 2, 'punct': 2, 'remnant': 2}, dict_openie_config=None)[source]¶

generate a proposition set from a set of extracted variables from comp_sem_lib.match_extraction_patterns().

Parameters

list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()
dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
proposition_pattern (list) – sequence of variable types to use to make a propositional expression
max_semantic_dist (int) – max allowed semantic distance between vars in a proposition (can be None)
include_context (bool) – include context variables within proposition phrases (maybe displaced from target variables in proposition_pattern)
space_replacement_char (str) – replacement char for all token spaces. should be same as prepare_tags_for_dependency_parse()
dict_sem_drift (dict) – dict of semantic drift costs for dep types e.g. { ‘conj’ : 2 }
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of tuples = (prop_phrase,prop_head,prop_addr_list,head_addr_list,pattern_index,proposition_pattern) or None. prop_phrase = propositional expression from extraction as defined by requested pattern [arg_phrase, rel_phrase, arg_phrase]. prob_head = propositional expression’s head tokens [arg_head, rel_head, arg_head]. pattern_index = index of original extraction pattern that generated this proposition.

Return type

soton_corenlppy.re.comp_sem_lib.generate_seed_tuples(list_sent_trees=None, generation_strategy='contiguous_tuple', set_annotations=None, dict_annotation_phrase_patterns={}, list_sequences=None, prevent_sequential_instances=None, lower_case=False, stemmer=None, dict_openie_config=None)[source]¶

generate seed tuples and var candidates ready for generate_open_extraction_templates() using a number of strategies. a seed tuple is a sequence of annotations suitable for graph walk targets. e.g. [ (‘ARGUMENT’,’London’,’bridge’), (‘RELATION’,’is’, ‘burning’), (‘ARGUMENT’,’down’) ] a var candidate is a sub-phrase (noun, verb or pronoun phrase) appearing in an annotation e.g. { ‘ARGUMENT’ : [‘London’, …] }

generation strategies:

predefined_sequences - explicitly allowed sent annotation sequences e.g. (arg rel arg), (arg rel arg rel arg)
contiguous_tuple - all possible sent annotation tuple (up to quad) contiguous sequence combinations e.g. (arg, rel, arg)
contiguous_tuple_candidates - all possible var candidate tuple (up to quad) contiguous sequence combinations e.g. (arg, rel, arg)
contiguous_tuple_with_seq_groups - all possible sent annotation tuple (up to quad), which meet both a contiguous and sequence group criteria e.g. (arg, (rel,arg))

var candidate generation strategy:

allow any pronoun, noun and verb phrase appearing within an annotation

returns a set of seed tuples that a later dependency graph walk can use as target var’s. returns a set of phrases for possible intermediate var candidates, so context var types can be replaced with known var types. note that ,:; characters are removed from generated seed tuples as they do not appear in a dep graph (so a tuple with them would never match) note that START and END are special labels indicating the start or end of the sent should be matched

Parameters

list_sent_trees (list) – list of nltk.Tree representing the sents in that doc = [ nltk.Tree( ‘(S (DT the) (ARGUMENT (NN handle)) (RELATION (VBZ missing)) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ ), … ]
generation_strategy (str) – name of generation strategy = predefined_sequences|contiguous_tuple|contiguous_tuple_candidates
set_annotations (set) – filter set of annotation labels to create sequences for e.g. set( [‘RELATION’,’ARGUMENT’] ) - [predefined_sequences, contiguous_triples]
dict_annotation_phrase_patterns (dict) – allowed phrase patterns for each variable type e.g. { ‘ARGUMENT’ : [‘noun_phrase’,’pronoun’], ‘RELATION’ : [verb_phrase’] }
list_sequences (list) – list of sequences to allow as seed_tuples, including special START and END labels. for predefined_sequences its the set of predefined sequences e.g. [ (‘ARGUMENT’,’RELATION’,’ARGUMENT’), (‘ARGUMENT’,’RELATION’) ]). for contiguous_tuple its the tuple pattern (up to quad) to generate e.g. [ ‘ARGUMENT’,’RELATION’,’ARGUMENT’ ] - [predefined_sequences, contiguous_triples]
prevent_sequential_instances (list) – list of seed types to prevent sequencial instance matching e.g. [‘PREPOSITION’] - [predefined_sequences]
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple of (list_seed_tuples, var_candidates). list_seed_tuples = [ ( (‘ARGUMENT’,’John’), (‘RELATION’,’come’,’from’), (‘ARGUMENT’,’Paris’) ), … ]. var_candidates = { ‘ARGUMENT’ : [ (‘John’,’Barnes’), (‘Pele’,) ] }

Return type

soton_corenlppy.re.comp_sem_lib.generate_seeds_and_templates_batch(dict_document_sent_trees={}, generation_strategy='contiguous_tuple', seed_filter_strategy='premissive', set_annotations=None, dict_annotation_phrase_patterns={}, list_sequences=None, prevent_sequential_instances=None, lower_case=False, stemmer=None, lex_phrase_index=None, lex_uri_index=None, dict_document_sent_graphs={}, dict_seed_to_template_mappings={}, dict_context_dep_types=[], max_processes=4, longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, dict_openie_config=None)[source]¶

aggregate function. this will call generate_seed_tuples() and then generate_open_extraction_templates_batch() per document rather than per corpus. this is a lot faster as the seed search space is constrained to document level not corpus level, but might reject some useful seeds not found in the original POS seed patterns.

Parameters

dict_document_sent_trees (dict) – dict of document ID keys, each value being a list of nltk.Tree representing the sents in that doc = [ nltk.Tree( ‘(S (DT the) (ARGUMENT (NN handle)) (RELATION (VBZ missing)) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ ), … ]
generation_strategy (str) – name of generation strategy = predefined_sequences|contiguous_tuple|contiguous_tuple_candidates
set_annotations (set) – filter set of annotation labels to create sequences for e.g. set( [‘RELATION’,’ARGUMENT’] ) - [predefined_sequences, contiguous_triples]
dict_annotation_phrase_patterns (dict) – allowed phrase patterns for each variable type e.g. { ‘ARGUMENT’ : [‘noun_phrase’,’pronoun’], ‘RELATION’ : [verb_phrase’] }
list_sequences (list) – list of sequences to allow as seed_tuples, including special START and END labels. for predefined_sequences its the set of predefined sequences e.g. [ (‘ARGUMENT’,’RELATION’,’ARGUMENT’), (‘ARGUMENT’,’RELATION’) ]). for contiguous_tuple its the tuple pattern (up to quad) to generate e.g. [ ‘ARGUMENT’,’RELATION’,’ARGUMENT’ ] - [predefined_sequences, contiguous_triples]
prevent_sequential_instances (list) – list of seed types to prevent sequencial instance matching e.g. [‘PREPOSITION’] - [predefined_sequences]
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
dict_document_sent_graphs (dict) – dict of document ID keys, each value being a list of nltk.parse.DependencyGraph objects for a corpus of sents
dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)
dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.
longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probbaly meaningless
max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.
allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)
avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

soton_corenlppy.re.comp_sem_lib.generate_seeds_and_templates_worker(tuple_queue=None, generation_strategy='contiguous_tuple', seed_filter_strategy='premissive', set_annotations=None, dict_annotation_phrase_patterns={}, list_sequences=None, prevent_sequential_instances=None, lower_case=False, stemmer=None, lex_phrase_index=None, lex_uri_index=None, dict_seed_to_template_mappings={}, dict_context_dep_types=[], longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, pause_on_start=0, process_id=0, dict_openie_config=None)[source]¶

worker thread for comp_sem_lib.generate_seeds_and_templates_batch()

Parameters

tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). the queueIn has serialized nltk.parse.DependencyGraph objects. queueOut has list of template patterns for this graph.
generation_strategy (str) – name of generation strategy = predefined_sequences|contiguous_tuple|contiguous_tuple_candidates
set_annotations (set) – filter set of annotation labels to create sequences for e.g. set( [‘RELATION’,’ARGUMENT’] ) - [predefined_sequences, contiguous_triples]
dict_annotation_phrase_patterns (dict) – allowed phrase patterns for each variable type e.g. { ‘ARGUMENT’ : [‘noun_phrase’,’pronoun’], ‘RELATION’ : [verb_phrase’] }
list_sequences (list) – list of sequences to allow as seed_tuples, including special START and END labels. for predefined_sequences its the set of predefined sequences e.g. [ (‘ARGUMENT’,’RELATION’,’ARGUMENT’), (‘ARGUMENT’,’RELATION’) ]). for contiguous_tuple its the tuple pattern (up to quad) to generate e.g. [ ‘ARGUMENT’,’RELATION’,’ARGUMENT’ ] - [predefined_sequences, contiguous_triples]
prevent_sequential_instances (list) – list of seed types to prevent sequencial instance matching e.g. [‘PREPOSITION’] - [predefined_sequences]
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)
dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.
longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probbaly meaningless
max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.
allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)
avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)
pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)
process_id (int) – process ID for logging
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.generate_templates_from_predefined_seeds_batch(dict_document_sent_trees={}, dict_document_seed_tuples={}, dict_document_var_candidates={}, seed_filter_strategy='premissive', lower_case=False, stemmer=None, lex_phrase_index=None, lex_uri_index=None, dict_document_sent_graphs={}, dict_seed_to_template_mappings={}, dict_context_dep_types=[], max_processes=4, longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, dict_openie_config=None)[source]¶

aggregate function using pre-loaded seed tuples

Parameters

dict_document_sent_trees (dict) – dict of document ID keys, each value being a list of nltk.Tree representing the sents in that doc = [ nltk.Tree( ‘(S (DT the) (ARGUMENT (NN handle)) (RELATION (VBZ missing)) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ ), … ]
dict_document_seed_tuples (dict) – dict of seed tuples for each doc created from generate_seed_tuples()
dict_document_var_candidates (dict) – dict of var candidates for each doc from generate_seed_tuples()
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
dict_document_sent_graphs (dict) – dict of document ID keys, each value being a list of nltk.parse.DependencyGraph objects for a corpus of sents
dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)
dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.
longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probbaly meaningless
max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.
allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)
avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

soton_corenlppy.re.comp_sem_lib.generate_templates_from_predefined_seeds_worker(tuple_queue=None, seed_filter_strategy='premissive', lower_case=False, stemmer=None, lex_phrase_index=None, lex_uri_index=None, dict_seed_to_template_mappings={}, dict_context_dep_types=[], longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, pause_on_start=0, process_id=0, dict_openie_config=None)[source]¶

worker thread for comp_sem_lib.generate_seeds_and_templates_batch()

Parameters

tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). the queueIn has serialized nltk.parse.DependencyGraph objects. queueOut has list of template patterns for this graph.
lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)
dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.
longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probbaly meaningless
max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.
allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)
avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)
pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)
process_id (int) – process ID for logging
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.get_dep_tree_addresses(dep_graph=None, branch_address=None, list_address_set=None, dict_openie_config=None)[source]¶

return a list of all address nodes under a branch node in a dep graph

Parameters

dep_graph (nltk.parse.DependencyGraph) – dependency graph to process
branch_address (int) – address of branch to process
list_address_set (list) – result list of addresses (list will be populated by function)
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.get_dependency_parser(dict_openie_config=None, dep_options=['-tokenized', '-tagSeparator', '/', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerMethod', 'newCoreLabelTokenizerFactory', '-maxLength', '200'])[source]¶

return a java command line for popen to run the Stanford dependency parser using exec_dep_parser(). CMD is without an input text filename so assumes tagged text is provided via STDIN. default options limit parser to 200 words to avoid reported dep parser hanging situations processing overly long sentences.

Parameters

dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()
dep_options (list) – list of options for stanford dep parser

Returns

java command line for popen()

Return type

soton_corenlppy.re.comp_sem_lib.get_extraction_vars(list_extracted_vars=None, dict_openie_config=None)[source]¶

return a list of extraction variable names and types from an extraction

Parameters

list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of tuple = ( variable_name, variable_type )

Return type

soton_corenlppy.re.comp_sem_lib.get_variables_connections(check_index=None, list_variables=None, dep_graph=None, str_connection_path='', recurse_address=None, dict_openie_config=None)[source]¶

get dependency graph connections from this variable (any addresses in collapsed set) to any other variables (any addresses in collapsed set)

Parameters

check_index (int) – index of variable to check
list_variables (list) – list of variables in extraction to try to connect
dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

{ ‘dep>nsubj’ : set([‘arg1’]), … }

Return type

soton_corenlppy.re.comp_sem_lib.index_cross_variable_connections(list_variables=None, dict_sem_drift={'appos': 2, 'cc': 2, 'conj': 2, 'dep': 0, 'dislocated': 2, 'list': 2, 'parataxis': 2, 'punct': 2, 'remnant': 2}, dict_openie_config=None)[source]¶

compute an index of how each variable connects to other variables within a specific extraction. direct connections between variables are indexed first, where one variable can be walked via dep graph to another directly. indirect N-deep connections between variables are also indexed, where one variable can be walked via dep graph to another directly OR via an intermediate variable. variable bases are used to map connections, not individual variables so multi-token variables will not be connected multiple times i.e. ‘arg1’ not ‘arg1_2’

the number of dep graph walk steps between variables is recorded as a kind of proxy to semantic drift along graph walk between extracted variables. each coordinating conjunction, loose joining relations and appositional modifier adds an extra 2 to the step count.

Index types:

index_direct_connect - base variable inter-connections
index_any_connect - compute for each variable the other variables it has a connection to (including via context and other intermediate variables)
index_context_connect - compute for each variable the other variables it has a connection to ONLY following context links

Parameters

list_variables (list) – list of variables in extraction
dict_sem_drift (dict) – dict of semantic drift costs for dep types e.g. { ‘conj’ : 2 }
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( index_direct_connect, index_any_connect, index_context_connect ). index_direct_connect = { var_base : set([ ( direct_var_connection, walk_steps ), … ]) }. index_any_connect = { var_base : set([ ( any_var_connection, walk_steps ) ]) }

Return type

soton_corenlppy.re.comp_sem_lib.kill_dep_parser()[source]¶: SIGTERM handler for exec_dep_parser() to ensure the stanford dep parser process is terminated. otherwise it will hang about forever waiting for more text to appear via STDIN

soton_corenlppy.re.comp_sem_lib.match_extraction_patterns(dep_graph=None, list_extraction_patterns=[], dict_collapse_dep_types={}, dict_assert_true={'en': ['true', 'genuine', 'real', 'confirmed', 'verified']}, dict_assert_false={'en': ['false', 'fake', 'hoax', 'joke', 'trick', 'debunked']}, dict_openie_config=None)[source]¶

apply a set of open pattern templates, parsed using comp_sem_lib.parse_extraction_pattern(), to a dependency graph. the patterns are executed in the order they appear in the list, and all possible matches are returned (a pattern could be materialized in several ways if there are multiple dep options for example) the result of a match is a set of matched variables from the open pattern template, including for each a graph_address (i.e. token position) and collapsed phrase (i.e. dependancy graph collapsed to produce a text phrase for the variable).

Parameters

dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
list_extraction_patterns (list) – list of parsed open pattern templates from comp_sem_lib.parse_extraction_pattern()
dict_collapse_dep_types (dict) – dependency graph types to use when collapsing variable branches = { ‘var_type’ : set([dep_type, dep_type, …]), … }
dict_assert_true (dict) – dict of language specific vocabulary for tokens asserting truth used for handling negation. use {} to avoid using a negation vocabulary.
dict_assert_false (dict) – dict of language specific vocabulary for tokens asserting falsehood used for handling negation. use {} to avoid using a negation vocabulary.
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of matches, each with a list of variables from the matched pattern, or [] if no match occured = [ [ ( var_type, var_name, graph_address, collapsed_graph_addresses[], dictConnection{}, pattern_index ), … ], … ]. dictConnection is obtained from get_variables_connections()

Return type

soton_corenlppy.re.comp_sem_lib.match_extraction_patterns_batch(dict_document_sent_graphs={}, list_extraction_patterns=[], dict_collapse_dep_types={}, dict_assert_true={'en': ['true', 'genuine', 'real', 'confirmed', 'verified']}, dict_assert_false={'en': ['false', 'fake', 'hoax', 'joke', 'trick', 'debunked']}, max_processes=4, dict_openie_config=None)[source]¶

apply a set of open pattern templates, parsed using comp_sem_lib.parse_extraction_pattern(), to a dependency graph. use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive.

the patterns are executed in the order they appear in the list. the result of a match is a set of matched variables from the open pattern template, including for each a graph_address (i.e. token position) and collapsed phrase (i.e. dependancy graph collapsed to produce a text phrase for the variable).

Parameters

dict_document_sent_graphs (dict) – dict of document ID keys, each value being a list of nltk.parse.DependencyGraph objects representing a sent
list_extraction_patterns (list) – list of parsed open pattern templates from comp_sem_lib.parse_extraction_pattern()
dict_collapse_dep_types (dict) – dependency graph types to use when collapsing variable branches = { ‘var_type’ : set([dep_type, dep_type, …]), … }
dict_assert_true (dict) – dict of language specific vocabulary for tokens asserting truth used for handling negation. use {} to avoid using a negation vocabulary.
dict_assert_false (dict) – dict of language specific vocabulary for tokens asserting falsehood used for handling negation. use {} to avoid using a negation vocabulary.
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

dict of sent matches = { documentID : [ [ [ ( var_type, var_name, graph_address, collapsed_graph_addresses[], { dep : [ var_name, … ] } ), … x Nvar ], … x Nextract ] … x Nsent ] }

Return type

soton_corenlppy.re.comp_sem_lib.match_extraction_patterns_batch_worker(tuple_queue=None, list_extraction_patterns=[], dict_collapse_dep_types={}, dict_assert_true={'en': ['true', 'genuine', 'real', 'confirmed', 'verified']}, dict_assert_false={'en': ['false', 'fake', 'hoax', 'joke', 'trick', 'debunked']}, pause_on_start=0, process_id=0, dict_openie_config=None)[source]¶

worker thread for comp_sem_lib.match_extraction_patterns_batch()

Parameters

tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). queueIn has tuples of ( doc_id, sent_index, serialized nltk.parse.DependencyGraph object ). queueOut has tuples of ( doc_id, sent_index, list_extractions ).
list_extraction_patterns (list) – list of parsed open pattern templates from comp_sem_lib.parse_extraction_pattern()
dict_collapse_dep_types (dict) – dependency graph types to use when collapsing variable branches = { ‘var_type’ : set([dep_type, dep_type, …]), … }
dict_assert_true (dict) – dict of language specific vocabulary for tokens asserting truth used for handling negation. use {} to avoid using a negation vocabulary.
dict_assert_false (dict) – dict of language specific vocabulary for tokens asserting falsehood used for handling negation. use {} to avoid using a negation vocabulary.
pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)
process_id (int) – ID of process for logging purposes
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.normalize_open_extraction_templates(list_patterns=None, topN=1000, lang='eng', dict_generalize_strategy={'relax_lex': ['arg', 'rel'], 'relax_pos': [], 'relax_pos_number_aware': ['ctxt']}, dict_openie_config=None)[source]¶

aggregate and normalize a set of open pattern templates generated by comp_sem_lib.generate_open_extraction_templates()

aggregate and normalize open pattern templates:

merge lexical and pos constraints for patterns with the same structure
remove all pos and lexical constraints on args based on dict_generalize_strategy
remove all lexical constraints on relations
TODO use lexicon (e.g. WordNet) to include semantic generalizations (i.e. hypernym) for known lexical constraints

Parameters

list_patterns (list) – list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()
topN (int) – top N templates to return (-1 for all)
lang (str) – WordNet language
dict_generalize_strategy (dict) – { ‘relax_lex’ : list_of_var_types, ‘relax_pos’ : list_of_var_types }
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings read for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

soton_corenlppy.re.comp_sem_lib.parse_allowed_dep_set(allowed_dep_set=None, dict_openie_config=None)[source]¶

internal function used by generate_open_extraction_templates() and match_extraction_patterns() via collapse_graph_to_make_phrase()

Parameters

allowed_dep_set (set) – set of allowed dependancy types to include in the collapsed phrase. supports lexical constraints (e.g. ‘case:of’) for a finer grained filtering. None to allow any dep.
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of tuple = [ ( dep, wild_card, [ conditionals, … ] ), … ]

Return type

soton_corenlppy.re.comp_sem_lib.parse_encoded_extraction(encoded_str=None, dict_openie_config=None)[source]¶

parse an encoded extraction produced from comp_sem_lib.encode_extraction(). variables ordered by address

Parameters

encoded_str (unicode) – encoded extraction from comp_sem_lib.pretty_print_extraction( style=’encoded’ )
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of extracted variables = [ ( var_name, var_head, var_phrase, { dep_path : [ var,var… ], … }, address, pattern_index, var_phrase_human ), … ]

Return type

soton_corenlppy.re.comp_sem_lib.parse_extraction_pattern(str_pattern=None, dict_openie_config=None)[source]¶

parse an open pattern template serialized as a unicode string into a format easier to process by the function comp_sem_lib.match_extraction_patterns().

the open pattern template represents a directed walk of a dependency parsed sentence graph. the pattern elements are matched left to right. arguments represent noun phrases. relations represent verbs. slots represent context to the relation such as adverbs.

pattern elements:

{varN_1:POS_MARK:pos=TAG|TAG|…;lex=TOKEN|TOKEN|…} = variable node (suffix ‘_<seed_token_id>’) that has a POS tag in the set defined (any if set not defined), and token in the set defined (any if set not defined). POS_MARK = S (start), E (end) or - (somewhere in middle of sent). var = {arg, rel, context, prep …}
<dep_label< = instruction to permanently move up dependency tree via a specific dependency label (abort if dep_label not found)
>dep_label> = instruction to permanently move down dependency tree via a specific dependency label (abort if dep_label not found)
-dep_label- = instruction to move to siblings, bind next variable then return to the original position in the graph (abort if dep_label not found)
+dep_label+ = instruction to move to children, bind next variable then return to the original position in the graph (abort if dep_label not found)

pattern examples:

{arg1_1:} <nsubj< {rel1_1:pos=VBD} >dobj> {arg2_1:}
{rel1_1:pos=VBN;lex=announce|choose} <amod< {context0_1:pos=JJ} +case+ {arg1_1:} -appos- {arg2_1:}

Parameters

str_pattern (unicode) – serialized open pattern template
dict_inverted_index_pos_phrase_patterns (dict) – inverted index from soton_corenlppy.re.openie_lib.calc_inverted_index_pos_phrase_patterns()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of tuples, each with a pattern instructions e.g. [ ( ‘rel’, ‘rel1_1’, ‘-‘, ( (‘vbz,vbn’),(‘contains’,’holds’) ) ), ( ‘dep_child’, ‘aux’ ), … ]

Return type

soton_corenlppy.re.comp_sem_lib.parse_sent_trees(list_sent_trees=None, dep_parser=None, dict_custom_pos_mappings={}, space_replacement_char='_', dict_openie_config=None)[source]¶

parse a list of sentence trees from soton_corenlppy.common_parse_lib.create_sent_trees() and return a list of dep graph objects

Parameters

list_sent_trees (list) – list of stanford POS tagged sent trees
dep_parser (list) – list of commands for popen() from get_dependency_parser()
dict_custom_pos_mappings (dict) – dict of custom POS mappings e.g. { ‘FIGURE’ : ‘CD’, ‘TABLE’ : ‘CD’, … }
space_replacement_char (str) – replacement char for all token spaces
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of nltk.parse.DependencyGraph

Return type

soton_corenlppy.re.comp_sem_lib.parse_sent_trees_batch(dict_doc_sent_trees=None, dep_parser=None, dict_custom_pos_mappings={}, space_replacement_char='_', max_processes=4, dict_openie_config=None)[source]¶

dependency parse a batch of documents, each with a list of Stanford POS tagged sents from soton_corenlppy.common_parse_lib.create_sent_trees() use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive.

Parameters

dict_doc_sent_trees (dict) – dict of documents { docID : list of sent trees }
dep_parser (list) – list of commands for popen() from get_dependency_parser()
dict_custom_pos_mappings (dict) – dict of custom POS mappings e.g. { ‘FIGURE’ : ‘CD’, ‘TABLE’ : ‘CD’, … }
space_replacement_char (str) – replacement char for all token spaces
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

dict of documents { docID : list of nltk.parse.DependencyGraph }

Return type

soton_corenlppy.re.comp_sem_lib.parse_sent_trees_worker(tuple_queue=None, dep_parser=None, dict_custom_pos_mappings={}, space_replacement_char='_', pause_on_start=0, process_id=0, dict_openie_config=None)[source]¶

worker thread for comp_sem_lib.parse_sent_trees_batch()

Parameters

tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). the queueIn has ( docID, list_sent_trees ). queueOut has ( docID, list_serialized_graphs )
dep_parser (list) – list of commands for popen() from get_dependency_parser()
dict_custom_pos_mappings (dict) – dict of custom POS mappings e.g. { ‘FIGURE’ : ‘CD’, ‘TABLE’ : ‘CD’, … }
space_replacement_char (str) – replacement char for all token spaces
pause_on_start (int) – number of seconds to pause to allow other workers to start
process_id (int) – process ID
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.pattern_match_recurse_into_graph(dep_graph=None, graph_address=0, force_next_address=None, pattern_spec=[], pattern_pos=0, pattern_result=None, pattern_success=None, dict_openie_config=None)[source]¶

internal function called by comp_sem_lib.match_extraction_patterns() try nodes in the dependency graph using a breadth first search strategy. for each search node try to apply the open pattern template and recuse until either failure or success (end of template) the reference parameter pattern_result contains the variables matched at what ever level of recursion the algoriothm has got to. memory note - a copy of the variables found is kept at each seatch node, so if the breadth first search has a lot of combinations then the memory footprint will get large.

Parameters

dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
graph_address (int) – address is the token index in the dependency graph
force_next_address (int) – address to force any recursive call (allows a temp node to be explored and then switch back to a previous position in graph)
pattern_spec (list) – parsed open pattern template from comp_sem_lib.parse_extraction_pattern()
pattern_pos (int) – current position in parsed open pattern template
pattern_result (list) – current partially populated patterns = [(var_type, var_name, graph_address), … ]
pattern_success (list) – successful fully populated patterns = [(var_type, var_name, graph_address), … ]
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.prepare_tags_for_dependency_parse(list_tagged_sents=None, dict_custom_pos_mappings={}, space_replacement_char='_', dict_openie_config=None)[source]¶

prepare a list of tagged sents for dependency parsing. the stanford dependency parser gets confused if there are spaces in tokens (e.g. phrases) and does not understand custom POS tags. for following processing is applied to list_tagged_sents:

replace all token spaces with a replacement char

replace all custom POS tags (e.g. CITATION) with a replacement the taggger understands (e.g. PENN tags)

Parameters

list_tagged_sents (list) – reference argument providing a list of tagged sents that will be modified directly i.e. [ [ (token, pos), (token, pos), … ], … ]
dict_custom_pos_mappings (dict) – dict of custom POS mappings e.g. { ‘FIGURE’ : ‘CD’, ‘TABLE’ : ‘CD’, … }
space_replacement_char (str) – replacement char for all token spaces
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.pretty_print_extraction(list_extracted_vars=None, dep_graph=None, set_var_types={}, style='highlighted_vars', space_replacement_char='_', dict_openie_config=None)[source]¶

pretty print a set of extracted variables from comp_sem_lib.match_extraction_patterns(). arguments are sorted in lexical order for easier reading.

Parameters

list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()
dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
set_var_types (set) – set of {var} types for pretty print
style (str) – style for print = highlighted_vars, plain_vars, tokens_only
space_replacement_char (str) – replacement char for all token spaces. should be same as prepare_tags_for_dependency_parse()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

pretty print version of arguments

Return type

unicode

soton_corenlppy.re.comp_sem_lib.pretty_print_extraction_var(list_extracted_vars=None, dep_graph=None, var_name=None, dict_pretty_dep_rels={'after_head': ['case', 'case:of', 'case:by'], 'any': ['compound', 'amod', 'nummod', 'advmod', 'cop', 'appos', 'dep', 'conj', 'nmod', 'xcomp'], 'before_head': []}, space_replacement_char='_', dict_openie_config=None)[source]¶

pretty print a specific variable in a specific extraction from comp_sem_lib.match_extraction_patterns(). pretty printed text appears in lexical order for easier reading.

Parameters

list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()
dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()
var_name (str) – name of variable
dict_pretty_dep_rels (dict) – dict of dep rels to allow in pretty print based on address position relative to head { ‘any’ : [], ‘before_head’ : [],’after_head’ : [] }. None allows any dep
space_replacement_char (str) – replacement char for all token spaces. should be same as prepare_tags_for_dependency_parse()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( pretty_print_variable, head_token, (negated, genuine), { dep_path : [ var,var… ], … } )

Return type