soton_corenlppy.re.comp_sem_lib module

Compositional semantics library

soton_corenlppy.re.comp_sem_lib.annotate_phrase_sequences(list_sent_tree=None, dict_inverted_index_pos_phrase_patterns=None, dict_openie_config=None)[source]

calc n-gram phrases from sequences of tokens with the same POS tag

Parameters
  • list_sent_tree (list) – list of nltk.Tree representing the sents in a doc = [ nltk.Tree( ‘(S (IN For) (NN handle) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ ), … ]

  • dict_inverted_index_pos_phrase_patterns (dict) – inverted index from soton_corenlppy.re.openie_lib.calc_inverted_index_pos_phrase_patterns()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of nltk.Tree representing the sents in that doc after phrase sequencing

Return type

list

soton_corenlppy.re.comp_sem_lib.annotate_using_pos_patterns(list_sent_trees=None, list_phrase_sequence_patterns_exec_order=['ENTITY', 'ENTITY_LIST', 'ENTITY_AND_CONTEXT', 'RELATION'], dict_phrase_sequence_patterns={'ENTITY': [re.compile('\\A.*?(?P<ENTITY>\\((PROPER_NOUN_P|NOUN_P) [^)]*\\)( \\((PROPER_NOUN_P|NOUN_P) [^)]*\\)){0,5})', re.DOTALL), re.compile('\\A.*?(?P<ENTITY>\\(PRONOUN_P [^)]*\\))', re.DOTALL)], 'ENTITY_AND_CONTEXT': [re.compile('\\A.*?(?P<ENTITY_AND_CONTEXT>\\((ENTITY|ENTITY_LIST) [^)]*\\) \\(PREPOSITION_P [^)]*\\)( \\(DETERMINER_P [^)]*\\)){0,1} \\((ENTITY|ENTITY_LIST) [^)]*\\))', re.DOTALL)], 'ENTITY_LIST': [re.compile('\\A.*?(?P<ENTITY_LIST>\\((ENTITY) [^)]*\\)( (\\((\\,|\\;|and) [^)]*\\) ){0,1}\\((ENTITY) [^)]*\\)){1,5})', re.DOTALL)], 'RELATION': [re.compile('\\A.*?(?P<RELATION>\\((ENTITY|ENTITY_LIST|ENTITY_AND_CONTEXT) [^)]*\\) (\\(VERB_AUXILLARY_P [^)]*\\) ){0,1}\\(VERB_P [^)]*\\)( \\(VERB_AUXILLARY_P [^)]*\\)){0,1}( \\((ENTITY|ENTITY_AND_CONTEXT|ADJECT, re.DOTALL), re.compile('\\A.*?(?P<RELATION>\\((ENTITY|ENTITY_LIST|ENTITY_AND_CONTEXT) [^)]*\\) (\\((VERB_AUXILLARY_P|RP|MD|ADVERB_P) [^)]*\\) ){0,1}\\(VERB_P [^)]*\\)( \\((VERB_AUXILLARY_P|RP|MD|ADVERB_P) [^)]*\\)){0,1} \\(, re.DOTALL), re.compile('\\A.*?(?P<RELATION>(\\(VERB_AUXILLARY_P [^)]*\\) ){0,1}\\(VERB_P [^)]*\\)( \\(VERB_AUXILLARY_P [^)]*\\)){0,1}( \\((ENTITY|ENTITY_AND_CONTEXT|ADJECTIVE_P|ADVERB_P|DETERMINER_P) [^)]*\\)){0,5}( \\(RP [, re.DOTALL), re.compile('\\A.*?(?P<RELATION>(\\((VERB_AUXILLARY_P|RP|MD|ADVERB_P) [^)]*\\) ){0,1}\\(VERB_P [^)]*\\)( \\((VERB_AUXILLARY_P|RP|MD|ADVERB_P) [^)]*\\)){0,1} \\((ENTITY|ENTITY_LIST|ENTITY_AND_CONTEXT) [^)]*\\))', re.DOTALL)]}, dict_openie_config=None)[source]

Apply on a set of tagged sent trees a set of pos patterns (e.g. for ReVerb arguments and relations or CH entities, lists of entities, attributes and relations). The result is a sent with POS pattern annotations represented as nltk.Tree elements.

Parameters
  • list_sent_trees (list) – list of nltk.Tree representing the sents in a doc

  • list_phrase_sequence_patterns_exec_order (list) – order that phrase sequence patterns should be executed (usually most permissive last). see openie.list_phrase_sequence_patterns_exec_order_default

  • dict_phrase_sequence_patterns (dict) – phrase sequence patterns of compiled regex extracted groups with same name as pattern = { pattern : [ regex, regex, … ] }. see openie.dict_phrase_sequence_patterns_default for example

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of documents, each containing a list of nltk.Tree objects representing an extracted lexico-pos pattern for entity, attribute and relation phrases

Return type

list

soton_corenlppy.re.comp_sem_lib.blank_serialized_tree(tree_sent=None, dict_openie_config=None)[source]

create a version of the serialized tree that blanks out all deep structures just for matching this allows regex matches to be a lot simpler and not sorry about n-deep () structures the size is identical to the original sent so we can use the match character positions later e.g. original = (S (ATTRIBUTE (NOUN_P handle plates) (PREPOSITION_P of) (NOUN_P column kraters)) (VERB_P attributed) …) blanked = (S (ATTRIBUTE ) (VERB_P ) …)

Parameters
  • treeSent (nltk.Tree) – nltk.Tree representing a sent = nltk.Tree( ‘(S (IN For) (NN handle) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ )

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

serialized tree blanked

Return type

unicode

soton_corenlppy.re.comp_sem_lib.calc_graph_paths_connecting_targets(list_targets=None, dep_graph=None, start_address=0, list_shortest_path=[], longest_dep_path=32, longest_inter_target_walk=2, avoid_dep_set={}, node_branch_index=None, dict_openie_config=None)[source]

calculate a graph walk path that connects all targets. this algorithm uses a fast depth first exploration of dep branches, as opposed to a slower random walk approach. the order of targets found is not important as we want to allow all sequences of targets (e.g. arg rel arg AND rel arg arg ).

Parameters
  • list_targets (list) – list of tuples (address, var_name, var_type) to find on graph walk in sequential order they must be found

  • dep_graph (nltk.parse.DependencyGraph) – dependancy graph for walk

  • start_address (int) – root address in graph branch to walk

  • list_shortest_path (list) – shortest path so far

  • longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.

  • longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probably meaningless

  • avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)

  • node_branch_index (dict) – index of branch addresses under each node (to make walk more efficient by avoiding branches without the target node)

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

shortest path connecting all targets = [ (address, var_name, var_type), … ]

Return type

list

soton_corenlppy.re.comp_sem_lib.check_for_negation(dep_graph=None, node_address=None, lang='en', dict_assert_true={'en': ['true', 'genuine', 'real', 'confirmed', 'verified']}, dict_assert_false={'en': ['false', 'fake', 'hoax', 'joke', 'trick', 'debunked']}, dict_openie_config=None)[source]

check dependency graph for evidence of negation of node or genuine/fake claims with this node as the subject negation strategy:

  • check neg dep

  • check amod dep leading to a true|false assertion (and then check for a neg dep)

  • check head for [nsubj,nsubjpass] dep originating from a true|false assertion (and then check head for a neg dep)

Parameters
  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • node_address (int) – address is the token index in the dependency graph

  • lang (str) – ISO 639-1 2 character language codes e.g. [‘en’,’fr’]

  • dict_assert_true (dict) – dict of language specific vocabulary for tokens asserting truth. use {} to avoid using a negation vocabulary.

  • dict_assert_false (dict) – dict of language specific vocabulary for tokens asserting falsehood. use {} to avoid using a negation vocabulary.

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

negation status = (negated, genuine) = (true|None, true|false|None)

Return type

tuple

soton_corenlppy.re.comp_sem_lib.collapse_graph_to_make_phrase(dep_graph=None, node_address=None, allowed_dep_set=None, forbidden_address_set={}, variable_head_range=None, search_depth=0, dict_openie_config=None)[source]

collapse a dependency graph and generate a set of text tokens representing a phrase for this node. ensure all tokens appear sequentially around the root node address (e.g. ‘the only other plan’ -> ‘the other plan’ is not allowed, ‘only other plan’ is allowed)

Parameters
  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • node_address (int) – address is the token index in the dependency graph

  • allowed_dep_set (set) – set of allowed dependancy types to include in the collapsed phrase. supports lexical constraints (e.g. ‘case:of’) for a finer grained filtering. None to allow any dep.

  • forbidden_address_set (set) – set of forbidden dependancy graph addresses to avoid in the collapsed phrase

  • variable_head_range (int) – tuple (first var head addr, last var head addr) for use with metadata commands (‘not_before_first_var’, ‘not_after_last_var’)

  • search_depth (int) – internal argument, recursion depth

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

success or failure of pattern matching. the reference argument pattern_result will contain the variables matched if this is successful.

Return type

bool

soton_corenlppy.re.comp_sem_lib.construct_node_index(dep_graph=None, dict_openie_config=None)[source]

compute a complete list of the addresses under each node. this is important to guide the graph walk later internal method called by generate_open_extraction_templates()

Parameters
  • dep_graph (nltk.parse.DependencyGraph) – nltk.parse.DependencyGraph object

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

index of each address containing node addresses under it in tree { addr : [ childaddr1, childaddr2 … ] }

Return type

dict

soton_corenlppy.re.comp_sem_lib.construct_seed_addr_options(seed_options=None, seed_graph_address_output=None, dict_openie_config=None)[source]

calculate seed address walks given a set of seed address options in a nested structure. internal method called by generate_open_extraction_templates()

Parameters
  • seed_options (dict) – dict of seed options

  • seed_graph_address_output (list) – output list which will be populated with seed walk options

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.encode_extraction(list_extracted_vars=None, dep_graph=None, set_var_types={}, dict_pretty_dep_rels={'after_head': ['case', 'case:of', 'case:by'], 'any': ['compound', 'amod', 'nummod', 'advmod', 'cop', 'appos', 'dep', 'conj', 'nmod', 'xcomp'], 'before_head': []}, space_replacement_char='_', dict_openie_config=None)[source]

encode an extraction in a serialize format that can be parsed using comp_sem_lib.parse_encoded_extraction()

Parameters
  • list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()

  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • set_var_types (set) – set of {var} types for pretty print

  • dict_pretty_dep_rels (dict) – dict of dep rels to allow in pretty print based on address position relative to head { ‘any’ : [], ‘not_before_head’ : [],’not_after_head’ : [] }

  • space_replacement_char (str) – replacement char for all token spaces

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

encoded extraction

Return type

unicode

soton_corenlppy.re.comp_sem_lib.escape_extraction_pattern(token=None)[source]

escape extraction pattern tokens. replacing | with -ESC_PIPE-, : with -ESC_COLON-, {} with -ESC_LCB- and -ESC_RCB-, [] with -ESC_LSB- and -ESC_LCB-. the escape replacement labels are deliberately different from Stanford escaping to avoid conflicts

Parameters

token (unicode) – token to escape

Returns

escaped token

Return type

unicode

soton_corenlppy.re.comp_sem_lib.exec_dep_parser(list_tagged_sents=None, dep_parser_cmd=None, dict_openie_config=None, timeout=300, sigterm_handler=False)[source]

exec a java command line (created from get_dependency_parser()) using popen to run the Stanford dependency parser. pipes are used to avoid any need for file IO

Parameters
  • list_tagged_sents (list) – list of tagged sents from soton_corenlppy.common_parse_lib.pos_tag_tokenset()

  • dep_parser_cmd (list) – list of commands for popen() from get_dependency_parser()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

  • timeout (int) – timeout in seconds for Dep Parser process in the unlikely event the POS tagger hangs

  • sigterm_handler (bool) – if True SIGTERM will be setup to terminate the process handle before exit

Returns

list of nltk.parse.DependencyGraph objects (one per sent)

Return type

list

soton_corenlppy.re.comp_sem_lib.exec_stanford_corenlp(dict_text=None, work_dir=None, annotators='tokenize,ssplit,pos,depparse,lemma,ner', option_list=['-tokenize.options', 'asciiQuotes=true,americanize=false', '-ssplit.eolonly', 'true'], num_processes=6, dict_openie_config=None)[source]
run stanford CoreNLP to do one or more of the following:
  • Tokenization

  • POS tagging

  • Dependancy parsing

  • NER

Parameters
Returns

tuple of requested information e.g. ( dict_tokens, dict_pos, dict_dep_graph, dict_ner ). all dict use sent_index as the key.

Return type

tuple

soton_corenlppy.re.comp_sem_lib.extract_annotations_from_sents(list_sent_trees=None, set_annotations={}, pos_tags=False, include_start_and_end_annotations=False, dict_openie_config=None)[source]

extract a set of annotations from a list of sent trees. for example extracting argument and relation annotations from ReVerb style annotated sent trees.

Parameters
  • list_sent_trees (list) – list of nltk.Tree representing the sents in that doc = [ nltk.Tree( ‘(S (IN For) (NN handle) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ ), … ]

  • set_annotations (set) – set of allowed annotation labels e.g. set( [‘RELATION’,’ARGUMENT’] )

  • pos_tags (bool) – if True use nltk.Tree.pos() to return tuples (‘token’,’pos’) not just strings ‘token’ after the node label

  • include_start_and_end_annotations (bool) – if True add START and END annotations if the sent start and end immediately preceeds or suffixes a matching annotation

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of sent extractions, each a list of annotation tuples (type,token) extracted in the order they appear in the sent e.g. [ [ (‘ARGUMENT’,’John’), (‘RELATION’,’come’,’from’), (‘ARGUMENT’,’Paris’) ], … ]

Return type

list

soton_corenlppy.re.comp_sem_lib.filter_extractions(dep_graph=None, list_extractions=[], filter_strategy='min_semantic_drift_per_target', use_context=False, max_context=4, min_var_connection=2, max_semantic_drift=4, target_var_type=None, dict_sem_drift={'appos': 2, 'cc': 2, 'conj': 2, 'dep': 0, 'dislocated': 2, 'list': 2, 'parataxis': 2, 'punct': 2, 'remnant': 2}, dict_openie_config=None)[source]

filter a set of extractions for a sent, to get a more focussed set without redundant and overlapping extractions.

filtering strategy - segment_coordinating_conj:
  • idea - ensures extractions cover the parts of a proposition, avoiding over-large propositions and making it easier to extract individual entity attributes (good for knowledge-base production methods)

  • segment sent address range by coordinating conjunction [CC , :] = (addr_start, addr_end)

  • for each address segment find smallest extractions that fully span this address range, favouring highest variable count if multiple options exist

  • if none exist get extractions that cover as many of the individual addresses, faviouring extractions whose size fits best to the segment size

filtering strategy - segment_subsumption:
  • idea - ensures extractions contain full context, providing large propositions that are easu for humans to understand (good for human intelligence gathering)

  • select extractions whose tokens fully subsume another extraction,favouring highest variable count if multiple options exist

  • head address only used for subsumption checks

filtering strategy - min_semantic_drift_per_target:
  • idea - ensures there is an extraction for each instance of a target var type (e.g. verb-mediated rel in sent)

  • for each target var instance select extractions containing it that have the lowest max inter-var path between variables (not context) and the target var. a preference for widest address range is used to differentiate between extractions with same max inter-var path.

filtering strategy - threshold_semantic_drift_per_target:
  • idea - ensures there is an extraction for each instance of a target var type (e.g. verb-mediated rel in sent)

  • for each target var instance select extractions containing it that have a <= threshold inter-var path between variables (not context) and the target var.

confidence value:
  • the confidence value appended to end of each extraction (higher numbers are better)

  • confidence = number of extractions allowed / total number of extractions [high good]

Parameters
  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • list_extractions (list) – list of extractions from comp_sem_lib.match_extraction_patterns()

  • filter_strategy (unicode) – name of filter strategy to apply. Can also be None for no filtering.

  • use_context (bool) – use context variables as targets when applying strategy (default is False so only non context variables are considered for intervar distances and address ranges)

  • max_context (int) – max number of context variables to allow per extraction (can be none)

  • min_var_connection (int) – min number of variables connected to target for extraction to be considered (filter prior to looking at semantic drift) (can be none)

  • max_semantic_drift (int) – max semantic drift between variables (can be none)

  • target_var_type (unicode) – name of target var type (for min_semantic_drift_per_target strategy)

  • dict_sem_drift (dict) – dict of semantic drift costs for dep types e.g. { ‘conj’ : 2 }

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = (list_extractions_filtered, list_conf). list_extractions_filtered is a filtered list of extractions using same format as comp_sem_lib.match_extraction_patterns(). list_conf is a list of float confidence values per extraction, high value = good

Return type

tuple

soton_corenlppy.re.comp_sem_lib.filter_open_extraction_templates_using_relevance_feedback(list_parsed_patterns=None, list_doc_set_of_propositions=None, list_relevance_feedback=None, dict_openie_config=None)[source]

filter a set of parsed open pattern templates generated by comp_sem_lib.generate_open_extraction_templates() using relevance feedback. relevance feedback is provided in the form of a list of scored extractions. any template which creates an extraction, which is incorrect will be removed from the list provided.

Parameters
  • list_parsed_patterns (list) – list of parsed open template extraction patterns generated by comp_sem_lib.parse_extraction_pattern()

  • list_doc_set_of_propositions (list) – list of tuples = ( str_index_doc, list_phrases_prop, parsed_pattern_index, list_prop_pattern, list_head_text ). parsed_pattern_index is index of pattern within list_parsed_patterns.

  • list_relevance_feedback (list) – list of tuples = ( str_index_doc, list_phrases_prop, str_score ). a score of 0 is incorrect, a score of 1 is correct.

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.filter_proposition_set(list_proposition_set=None, list_proposition_set_conf=None, target_index=1, filter_strategy='prop_subsumption', dict_index_stoplist_prefix={0: ['and ', 'or ', 'of ', 'to ', 'in ', 'into ', 'for ', 'at ', "'s ", 'by ', 'on ', 'him '], 1: ['and ', 'or '], 2: ['and ', 'or ']}, dict_index_stoplist_suffix={0: [' and', ' or'], 1: [' and', ' or'], 2: [' and', ' or', ' those', ' he', ' she', ' they', ' the', ' a', ' would']}, lex_phrase_index=None, lex_uri_index=None, dict_openie_config=None)[source]

filter a proposition set selecting the best target. filtered entries are deleted from list_proposition_set

filter strategy:
  • min_length - group propositions at target_index which share a common head address. sort this group by address length (target_index first, then other indexes). select top of list (min length) as single option.

  • max_length - group propositions at target_index which share a common head address. sort this group by address length (target_index first, then other indexes). select bottom of list (max length) as single option.

  • prop_subsumption - remove any n-gram proposition that is subsumed by a higher gram proposition.

  • lexicon_filter - remove any n-gram proposition which does not at least 1 variable with a lexicon phrase match (unigram, bigram and trigrams are checked with morphy).

Parameters
  • list_proposition_set (list) – list of tuples obtained from calls to comp_sem_lib.generate_proposition_set_from_extraction(). filtered entries will be removed from this set.

  • list_proposition_set_conf (list) – list of confidence values associated with each proposition. filtered entries will be removed from this set.

  • target_index (int) – proposition index of target variable to base filtering on (e,.g. index of rel)

  • filter_strategy (str) – filter strategy = min_length|max_length

  • index_stoplist_prefix (dict) – dict of prefixes for each index to not allow e.g. ‘of ‘ on first argument of {arg,rel,arg}

  • index_stoplist_suffix (dict) – dict of suffixes for each index to not allow e.g. ‘ and’ on rel argument of {arg,rel,arg}

  • lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.generate_open_extraction_templates(seed_tuples=None, var_candidates=None, corpus_sent_graphs=None, dict_seed_to_template_mappings={}, dict_context_dep_types=[], longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, space_replacement_char='_', dict_openie_config=None)[source]

generate a set of open pattern templates based on a training corpus (dependency parsed into graphs) and seed_tuples with known ‘high quality’ argument and relation groups.

for each seed_tuple generate specific open pattern templates:
  • for each sent, get all possible combinations of seed tokens where the seed tokens appear in the same sequential order as the seed tuple

  • remove any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch). subsumed seeds will be includes later when branches are collapsed.

  • for each combination of seed tokens, compute the shortest dependancy graph path which contains all seed tuple words

  • reject dependancy path lengths > threshold, as too verbose and unlikely to express the true original tuple’s meaning

  • walk the dependancy path and generate a very specific open pattern templates (lexical and pos constraints)

Parameters
  • seed_tuples (list) – list (or set) of seed_tuples from comp_sem_lib.generate_seed_tuples()

  • var_candidates (dict) – dict of seed tuple variable types, each containing a list of phrases that are var candidates, from comp_sem_lib.generate_seed_tuples()

  • corpus_sent_graphs (list) – list of nltk.parse.DependencyGraph objects for a corpus of sents

  • dict_seed_to_template_mappings (dict) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’). template type must not contain a ‘_’ character.

  • dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)

  • longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.

  • longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probably meaningless

  • max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.

  • allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)

  • avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)

  • space_replacement_char (str) – replacement char for all token spaces as dep graph cannot have a space. should be same as prepare_tags_for_dependency_parse()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

list

soton_corenlppy.re.comp_sem_lib.generate_open_extraction_templates_batch(seed_tuples=None, var_candidates=None, dict_document_sent_graphs={}, dict_seed_to_template_mappings={}, dict_context_dep_types=[], max_processes=4, longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, dict_openie_config=None)[source]

generate a set of open pattern templates based on a training corpus (dependency parsed into graphs) and seed_tuples with known ‘high quality’ argument and relation groups. use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive.

see comp_sem_lib.generate_open_extraction_templates() for details

Parameters
  • seed_tuples (list) – list (or set) of seed_tuples from comp_sem_lib.generate_seed_tuples()

  • var_candidates (dict) – dict of seed tuple variable types, each value containing a list of phrase tuples that are var candidates, from comp_sem_lib.generate_seed_tuples()

  • dict_document_sent_graphs (dict) – dict of document ID keys, each value being a list of nltk.parse.DependencyGraph objects for a corpus of sents

  • dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)

  • dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)

  • max_processes (int) – number of worker processes to spawn using multiprocessing.Process

  • longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.

  • longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probably meaningless

  • max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.

  • allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)

  • avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

list

soton_corenlppy.re.comp_sem_lib.generate_open_extraction_templates_worker(seed_tuples=None, var_candidates=None, tuple_queue=None, dict_seed_to_template_mappings={}, dict_context_dep_types=[], longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, pause_on_start=0, process_id=0, dict_openie_config=None)[source]

worker thread for comp_sem_lib.generate_open_extraction_templates_batch()

Parameters
  • seed_tuples (list) – list (or set) of seed_tuples from comp_sem_lib.generate_seed_tuples()

  • var_candidates (dict) – dict of seed tuple variable types, each containing a list of phrases that are var candidates, from comp_sem_lib.generate_seed_tuples()

  • tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). the queueIn has serialized nltk.parse.DependencyGraph objects. queueOut has list of template patterns for this graph.

  • dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)

  • dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)

  • longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.

  • longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probably meaningless

  • max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.

  • allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)

  • avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)

  • pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)

  • process_id (int) – process ID for logging

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.generate_proposition_set_from_extraction(list_extracted_vars=None, dep_graph=None, proposition_pattern=['arg', 'rel', 'arg'], dict_displaced_context={'arg': [], 'rel': ['nsubj', 'nsubjpass', 'dobj', 'iobj', 'csubj', 'csubjpass', 'xcomp', 'nmod', 'nmod:*', 'advcl', 'advcl:*', 'neg', 'nfincl', 'nfincl:*', 'ncmod', 'ncmod:*', 'acl', 'acl:*', 'vocative', 'discourse', 'expl', 'aux', 'auxpass', 'cop', 'mark', 'punct', 'nummod', 'appos', 'nmod', 'nmod:*', 'relcl', 'nfincl', 'nfincl:*', 'ncmod', 'ncmod:*', 'amod', 'det', 'neg', 'compound', 'compound:*', 'name', 'mwe', 'foreign', 'goeswith', 'conj', 'cc', 'dep']}, max_semantic_dist=None, include_context=True, space_replacement_char='_', dict_sem_drift={'appos': 2, 'cc': 2, 'conj': 2, 'dep': 0, 'dislocated': 2, 'list': 2, 'parataxis': 2, 'punct': 2, 'remnant': 2}, dict_openie_config=None)[source]

generate a proposition set from a set of extracted variables from comp_sem_lib.match_extraction_patterns().

Parameters
  • list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()

  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • proposition_pattern (list) – sequence of variable types to use to make a propositional expression

  • max_semantic_dist (int) – max allowed semantic distance between vars in a proposition (can be None)

  • include_context (bool) – include context variables within proposition phrases (maybe displaced from target variables in proposition_pattern)

  • space_replacement_char (str) – replacement char for all token spaces. should be same as prepare_tags_for_dependency_parse()

  • dict_sem_drift (dict) – dict of semantic drift costs for dep types e.g. { ‘conj’ : 2 }

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of tuples = (prop_phrase,prop_head,prop_addr_list,head_addr_list,pattern_index,proposition_pattern) or None. prop_phrase = propositional expression from extraction as defined by requested pattern [arg_phrase, rel_phrase, arg_phrase]. prob_head = propositional expression’s head tokens [arg_head, rel_head, arg_head]. pattern_index = index of original extraction pattern that generated this proposition.

Return type

list

soton_corenlppy.re.comp_sem_lib.generate_seed_tuples(list_sent_trees=None, generation_strategy='contiguous_tuple', set_annotations=None, dict_annotation_phrase_patterns={}, list_sequences=None, prevent_sequential_instances=None, lower_case=False, stemmer=None, dict_openie_config=None)[source]

generate seed tuples and var candidates ready for generate_open_extraction_templates() using a number of strategies. a seed tuple is a sequence of annotations suitable for graph walk targets. e.g. [ (‘ARGUMENT’,’London’,’bridge’), (‘RELATION’,’is’, ‘burning’), (‘ARGUMENT’,’down’) ] a var candidate is a sub-phrase (noun, verb or pronoun phrase) appearing in an annotation e.g. { ‘ARGUMENT’ : [‘London’, …] }

generation strategies:
  • predefined_sequences - explicitly allowed sent annotation sequences e.g. (arg rel arg), (arg rel arg rel arg)

  • contiguous_tuple - all possible sent annotation tuple (up to quad) contiguous sequence combinations e.g. (arg, rel, arg)

  • contiguous_tuple_candidates - all possible var candidate tuple (up to quad) contiguous sequence combinations e.g. (arg, rel, arg)

  • contiguous_tuple_with_seq_groups - all possible sent annotation tuple (up to quad), which meet both a contiguous and sequence group criteria e.g. (arg, (rel,arg))

var candidate generation strategy:
  • allow any pronoun, noun and verb phrase appearing within an annotation

returns a set of seed tuples that a later dependency graph walk can use as target var’s. returns a set of phrases for possible intermediate var candidates, so context var types can be replaced with known var types. note that ,:; characters are removed from generated seed tuples as they do not appear in a dep graph (so a tuple with them would never match) note that START and END are special labels indicating the start or end of the sent should be matched

Parameters
  • list_sent_trees (list) – list of nltk.Tree representing the sents in that doc = [ nltk.Tree( ‘(S (DT the) (ARGUMENT (NN handle)) (RELATION (VBZ missing)) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ ), … ]

  • generation_strategy (str) – name of generation strategy = predefined_sequences|contiguous_tuple|contiguous_tuple_candidates

  • set_annotations (set) – filter set of annotation labels to create sequences for e.g. set( [‘RELATION’,’ARGUMENT’] ) - [predefined_sequences, contiguous_triples]

  • dict_annotation_phrase_patterns (dict) – allowed phrase patterns for each variable type e.g. { ‘ARGUMENT’ : [‘noun_phrase’,’pronoun’], ‘RELATION’ : [verb_phrase’] }

  • list_sequences (list) – list of sequences to allow as seed_tuples, including special START and END labels. for predefined_sequences its the set of predefined sequences e.g. [ (‘ARGUMENT’,’RELATION’,’ARGUMENT’), (‘ARGUMENT’,’RELATION’) ]). for contiguous_tuple its the tuple pattern (up to quad) to generate e.g. [ ‘ARGUMENT’,’RELATION’,’ARGUMENT’ ] - [predefined_sequences, contiguous_triples]

  • prevent_sequential_instances (list) – list of seed types to prevent sequencial instance matching e.g. [‘PREPOSITION’] - [predefined_sequences]

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple of (list_seed_tuples, var_candidates). list_seed_tuples = [ ( (‘ARGUMENT’,’John’), (‘RELATION’,’come’,’from’), (‘ARGUMENT’,’Paris’) ), … ]. var_candidates = { ‘ARGUMENT’ : [ (‘John’,’Barnes’), (‘Pele’,) ] }

Return type

tuple

soton_corenlppy.re.comp_sem_lib.generate_seeds_and_templates_batch(dict_document_sent_trees={}, generation_strategy='contiguous_tuple', seed_filter_strategy='premissive', set_annotations=None, dict_annotation_phrase_patterns={}, list_sequences=None, prevent_sequential_instances=None, lower_case=False, stemmer=None, lex_phrase_index=None, lex_uri_index=None, dict_document_sent_graphs={}, dict_seed_to_template_mappings={}, dict_context_dep_types=[], max_processes=4, longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, dict_openie_config=None)[source]

aggregate function. this will call generate_seed_tuples() and then generate_open_extraction_templates_batch() per document rather than per corpus. this is a lot faster as the seed search space is constrained to document level not corpus level, but might reject some useful seeds not found in the original POS seed patterns.

Parameters
  • dict_document_sent_trees (dict) – dict of document ID keys, each value being a list of nltk.Tree representing the sents in that doc = [ nltk.Tree( ‘(S (DT the) (ARGUMENT (NN handle)) (RELATION (VBZ missing)) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ ), … ]

  • generation_strategy (str) – name of generation strategy = predefined_sequences|contiguous_tuple|contiguous_tuple_candidates

  • set_annotations (set) – filter set of annotation labels to create sequences for e.g. set( [‘RELATION’,’ARGUMENT’] ) - [predefined_sequences, contiguous_triples]

  • dict_annotation_phrase_patterns (dict) – allowed phrase patterns for each variable type e.g. { ‘ARGUMENT’ : [‘noun_phrase’,’pronoun’], ‘RELATION’ : [verb_phrase’] }

  • list_sequences (list) – list of sequences to allow as seed_tuples, including special START and END labels. for predefined_sequences its the set of predefined sequences e.g. [ (‘ARGUMENT’,’RELATION’,’ARGUMENT’), (‘ARGUMENT’,’RELATION’) ]). for contiguous_tuple its the tuple pattern (up to quad) to generate e.g. [ ‘ARGUMENT’,’RELATION’,’ARGUMENT’ ] - [predefined_sequences, contiguous_triples]

  • prevent_sequential_instances (list) – list of seed types to prevent sequencial instance matching e.g. [‘PREPOSITION’] - [predefined_sequences]

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)

  • lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • dict_document_sent_graphs (dict) – dict of document ID keys, each value being a list of nltk.parse.DependencyGraph objects for a corpus of sents

  • dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)

  • dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)

  • max_processes (int) – number of worker processes to spawn using multiprocessing.Process

  • longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.

  • longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probbaly meaningless

  • max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.

  • allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)

  • avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

list

soton_corenlppy.re.comp_sem_lib.generate_seeds_and_templates_worker(tuple_queue=None, generation_strategy='contiguous_tuple', seed_filter_strategy='premissive', set_annotations=None, dict_annotation_phrase_patterns={}, list_sequences=None, prevent_sequential_instances=None, lower_case=False, stemmer=None, lex_phrase_index=None, lex_uri_index=None, dict_seed_to_template_mappings={}, dict_context_dep_types=[], longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, pause_on_start=0, process_id=0, dict_openie_config=None)[source]

worker thread for comp_sem_lib.generate_seeds_and_templates_batch()

Parameters
  • tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). the queueIn has serialized nltk.parse.DependencyGraph objects. queueOut has list of template patterns for this graph.

  • generation_strategy (str) – name of generation strategy = predefined_sequences|contiguous_tuple|contiguous_tuple_candidates

  • set_annotations (set) – filter set of annotation labels to create sequences for e.g. set( [‘RELATION’,’ARGUMENT’] ) - [predefined_sequences, contiguous_triples]

  • dict_annotation_phrase_patterns (dict) – allowed phrase patterns for each variable type e.g. { ‘ARGUMENT’ : [‘noun_phrase’,’pronoun’], ‘RELATION’ : [verb_phrase’] }

  • list_sequences (list) – list of sequences to allow as seed_tuples, including special START and END labels. for predefined_sequences its the set of predefined sequences e.g. [ (‘ARGUMENT’,’RELATION’,’ARGUMENT’), (‘ARGUMENT’,’RELATION’) ]). for contiguous_tuple its the tuple pattern (up to quad) to generate e.g. [ ‘ARGUMENT’,’RELATION’,’ARGUMENT’ ] - [predefined_sequences, contiguous_triples]

  • prevent_sequential_instances (list) – list of seed types to prevent sequencial instance matching e.g. [‘PREPOSITION’] - [predefined_sequences]

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)

  • lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)

  • dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)

  • max_processes (int) – number of worker processes to spawn using multiprocessing.Process

  • longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.

  • longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probbaly meaningless

  • max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.

  • allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)

  • avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)

  • pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)

  • process_id (int) – process ID for logging

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.generate_templates_from_predefined_seeds_batch(dict_document_sent_trees={}, dict_document_seed_tuples={}, dict_document_var_candidates={}, seed_filter_strategy='premissive', lower_case=False, stemmer=None, lex_phrase_index=None, lex_uri_index=None, dict_document_sent_graphs={}, dict_seed_to_template_mappings={}, dict_context_dep_types=[], max_processes=4, longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, dict_openie_config=None)[source]

aggregate function using pre-loaded seed tuples

Parameters
  • dict_document_sent_trees (dict) – dict of document ID keys, each value being a list of nltk.Tree representing the sents in that doc = [ nltk.Tree( ‘(S (DT the) (ARGUMENT (NN handle)) (RELATION (VBZ missing)) … (REF (NP Agora) (NP XXIII) (, ,) (DOC_SECTION pl. 44) (DOC_SECTION no. 448)) …)’ ), … ]

  • dict_document_seed_tuples (dict) – dict of seed tuples for each doc created from generate_seed_tuples()

  • dict_document_var_candidates (dict) – dict of var candidates for each doc from generate_seed_tuples()

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)

  • lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • dict_document_sent_graphs (dict) – dict of document ID keys, each value being a list of nltk.parse.DependencyGraph objects for a corpus of sents

  • dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)

  • dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)

  • max_processes (int) – number of worker processes to spawn using multiprocessing.Process

  • longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.

  • longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probbaly meaningless

  • max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.

  • allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)

  • avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

list

soton_corenlppy.re.comp_sem_lib.generate_templates_from_predefined_seeds_worker(tuple_queue=None, seed_filter_strategy='premissive', lower_case=False, stemmer=None, lex_phrase_index=None, lex_uri_index=None, dict_seed_to_template_mappings={}, dict_context_dep_types=[], longest_dep_path=32, longest_inter_target_walk=2, max_seed_variants=128, allow_seed_subsumption=True, avoid_dep_set={}, pause_on_start=0, process_id=0, dict_openie_config=None)[source]

worker thread for comp_sem_lib.generate_seeds_and_templates_batch()

Parameters
  • tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). the queueIn has serialized nltk.parse.DependencyGraph objects. queueOut has list of template patterns for this graph.

  • lower_case (bool) – if True all lexicon tokens will be converted to lower case. otherwise case is left intact.

  • stemmer (nltk.stem.api.StemmerI) – stemmer to use on last phrase token (default is None)

  • lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()

  • dict_seed_to_template_mappings (list) – dict of mappings from seed_tuple type names (e.g. ‘ARGUMENT’) to open extraction template types (e.g. ‘arg’)

  • dict_context_dep_types (dict) – dict of contextual dependency types that are to be added if not already on graph path (e.g. neg)

  • max_processes (int) – number of worker processes to spawn using multiprocessing.Process

  • longest_dep_path (int) – longest graph path allowed for walks, to avoid very large walks with many combinations that simply take too long.

  • longest_inter_target_walk (int) – longest inter-target variable walk distance allowed. if there are too many dep graph steps the semantic drift will be too big and the resulting extraction probbaly meaningless

  • max_seed_variants (int) – max number of seed variants possible for an individual sent graph. seed variants are created by matching seed tokens to sent graph tokens, and exploding the combinations so all possibilities are checked. if a seed phrase contains tokens that appear many times in a sent, the combinations could get large. this setting provides an upper limit to ensure for these unusual cases processing time is not excessive.

  • allow_seed_subsumption (bool) – if True removes any seed token which is subsumed by other seed token (i.e. its under a root seed node on a dependancy graph branch)

  • avoid_dep_set (set) – set of dep types to avoid walking (defauly empty set)

  • pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)

  • process_id (int) – process ID for logging

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.get_dep_tree_addresses(dep_graph=None, branch_address=None, list_address_set=None, dict_openie_config=None)[source]

return a list of all address nodes under a branch node in a dep graph

Parameters
  • dep_graph (nltk.parse.DependencyGraph) – dependency graph to process

  • branch_address (int) – address of branch to process

  • list_address_set (list) – result list of addresses (list will be populated by function)

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.get_dependency_parser(dict_openie_config=None, dep_options=['-tokenized', '-tagSeparator', '/', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerMethod', 'newCoreLabelTokenizerFactory', '-maxLength', '200'])[source]

return a java command line for popen to run the Stanford dependency parser using exec_dep_parser(). CMD is without an input text filename so assumes tagged text is provided via STDIN. default options limit parser to 200 words to avoid reported dep parser hanging situations processing overly long sentences.

Parameters
  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

  • dep_options (list) – list of options for stanford dep parser

Returns

java command line for popen()

Return type

list

soton_corenlppy.re.comp_sem_lib.get_extraction_vars(list_extracted_vars=None, dict_openie_config=None)[source]

return a list of extraction variable names and types from an extraction

Parameters
  • list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of tuple = ( variable_name, variable_type )

Return type

list

soton_corenlppy.re.comp_sem_lib.get_variables_connections(check_index=None, list_variables=None, dep_graph=None, str_connection_path='', recurse_address=None, dict_openie_config=None)[source]

get dependency graph connections from this variable (any addresses in collapsed set) to any other variables (any addresses in collapsed set)

Parameters
  • check_index (int) – index of variable to check

  • list_variables (list) – list of variables in extraction to try to connect

  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

{ ‘dep>nsubj’ : set([‘arg1’]), … }

Return type

dict

soton_corenlppy.re.comp_sem_lib.index_cross_variable_connections(list_variables=None, dict_sem_drift={'appos': 2, 'cc': 2, 'conj': 2, 'dep': 0, 'dislocated': 2, 'list': 2, 'parataxis': 2, 'punct': 2, 'remnant': 2}, dict_openie_config=None)[source]

compute an index of how each variable connects to other variables within a specific extraction. direct connections between variables are indexed first, where one variable can be walked via dep graph to another directly. indirect N-deep connections between variables are also indexed, where one variable can be walked via dep graph to another directly OR via an intermediate variable. variable bases are used to map connections, not individual variables so multi-token variables will not be connected multiple times i.e. ‘arg1’ not ‘arg1_2’

the number of dep graph walk steps between variables is recorded as a kind of proxy to semantic drift along graph walk between extracted variables. each coordinating conjunction, loose joining relations and appositional modifier adds an extra 2 to the step count.

Index types:
  • index_direct_connect - base variable inter-connections

  • index_any_connect - compute for each variable the other variables it has a connection to (including via context and other intermediate variables)

  • index_context_connect - compute for each variable the other variables it has a connection to ONLY following context links

Parameters
  • list_variables (list) – list of variables in extraction

  • dict_sem_drift (dict) – dict of semantic drift costs for dep types e.g. { ‘conj’ : 2 }

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( index_direct_connect, index_any_connect, index_context_connect ). index_direct_connect = { var_base : set([ ( direct_var_connection, walk_steps ), … ]) }. index_any_connect = { var_base : set([ ( any_var_connection, walk_steps ) ]) }

Return type

tuple

soton_corenlppy.re.comp_sem_lib.kill_dep_parser()[source]

SIGTERM handler for exec_dep_parser() to ensure the stanford dep parser process is terminated. otherwise it will hang about forever waiting for more text to appear via STDIN

soton_corenlppy.re.comp_sem_lib.match_extraction_patterns(dep_graph=None, list_extraction_patterns=[], dict_collapse_dep_types={}, dict_assert_true={'en': ['true', 'genuine', 'real', 'confirmed', 'verified']}, dict_assert_false={'en': ['false', 'fake', 'hoax', 'joke', 'trick', 'debunked']}, dict_openie_config=None)[source]

apply a set of open pattern templates, parsed using comp_sem_lib.parse_extraction_pattern(), to a dependency graph. the patterns are executed in the order they appear in the list, and all possible matches are returned (a pattern could be materialized in several ways if there are multiple dep options for example) the result of a match is a set of matched variables from the open pattern template, including for each a graph_address (i.e. token position) and collapsed phrase (i.e. dependancy graph collapsed to produce a text phrase for the variable).

Parameters
  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • list_extraction_patterns (list) – list of parsed open pattern templates from comp_sem_lib.parse_extraction_pattern()

  • dict_collapse_dep_types (dict) – dependency graph types to use when collapsing variable branches = { ‘var_type’ : set([dep_type, dep_type, …]), … }

  • dict_assert_true (dict) – dict of language specific vocabulary for tokens asserting truth used for handling negation. use {} to avoid using a negation vocabulary.

  • dict_assert_false (dict) – dict of language specific vocabulary for tokens asserting falsehood used for handling negation. use {} to avoid using a negation vocabulary.

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of matches, each with a list of variables from the matched pattern, or [] if no match occured = [ [ ( var_type, var_name, graph_address, collapsed_graph_addresses[], dictConnection{}, pattern_index ), … ], … ]. dictConnection is obtained from get_variables_connections()

Return type

list

soton_corenlppy.re.comp_sem_lib.match_extraction_patterns_batch(dict_document_sent_graphs={}, list_extraction_patterns=[], dict_collapse_dep_types={}, dict_assert_true={'en': ['true', 'genuine', 'real', 'confirmed', 'verified']}, dict_assert_false={'en': ['false', 'fake', 'hoax', 'joke', 'trick', 'debunked']}, max_processes=4, dict_openie_config=None)[source]

apply a set of open pattern templates, parsed using comp_sem_lib.parse_extraction_pattern(), to a dependency graph. use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive.

the patterns are executed in the order they appear in the list. the result of a match is a set of matched variables from the open pattern template, including for each a graph_address (i.e. token position) and collapsed phrase (i.e. dependancy graph collapsed to produce a text phrase for the variable).

Parameters
  • dict_document_sent_graphs (dict) – dict of document ID keys, each value being a list of nltk.parse.DependencyGraph objects representing a sent

  • list_extraction_patterns (list) – list of parsed open pattern templates from comp_sem_lib.parse_extraction_pattern()

  • dict_collapse_dep_types (dict) – dependency graph types to use when collapsing variable branches = { ‘var_type’ : set([dep_type, dep_type, …]), … }

  • dict_assert_true (dict) – dict of language specific vocabulary for tokens asserting truth used for handling negation. use {} to avoid using a negation vocabulary.

  • dict_assert_false (dict) – dict of language specific vocabulary for tokens asserting falsehood used for handling negation. use {} to avoid using a negation vocabulary.

  • max_processes (int) – number of worker processes to spawn using multiprocessing.Process

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

dict of sent matches = { documentID : [ [ [ ( var_type, var_name, graph_address, collapsed_graph_addresses[], { dep : [ var_name, … ] } ), … x Nvar ], … x Nextract ] … x Nsent ] }

Return type

dict

soton_corenlppy.re.comp_sem_lib.match_extraction_patterns_batch_worker(tuple_queue=None, list_extraction_patterns=[], dict_collapse_dep_types={}, dict_assert_true={'en': ['true', 'genuine', 'real', 'confirmed', 'verified']}, dict_assert_false={'en': ['false', 'fake', 'hoax', 'joke', 'trick', 'debunked']}, pause_on_start=0, process_id=0, dict_openie_config=None)[source]

worker thread for comp_sem_lib.match_extraction_patterns_batch()

Parameters
  • tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). queueIn has tuples of ( doc_id, sent_index, serialized nltk.parse.DependencyGraph object ). queueOut has tuples of ( doc_id, sent_index, list_extractions ).

  • list_extraction_patterns (list) – list of parsed open pattern templates from comp_sem_lib.parse_extraction_pattern()

  • dict_collapse_dep_types (dict) – dependency graph types to use when collapsing variable branches = { ‘var_type’ : set([dep_type, dep_type, …]), … }

  • dict_assert_true (dict) – dict of language specific vocabulary for tokens asserting truth used for handling negation. use {} to avoid using a negation vocabulary.

  • dict_assert_false (dict) – dict of language specific vocabulary for tokens asserting falsehood used for handling negation. use {} to avoid using a negation vocabulary.

  • pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)

  • process_id (int) – ID of process for logging purposes

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.normalize_open_extraction_templates(list_patterns=None, topN=1000, lang='eng', dict_generalize_strategy={'relax_lex': ['arg', 'rel'], 'relax_pos': [], 'relax_pos_number_aware': ['ctxt']}, dict_openie_config=None)[source]

aggregate and normalize a set of open pattern templates generated by comp_sem_lib.generate_open_extraction_templates()

aggregate and normalize open pattern templates:
  • merge lexical and pos constraints for patterns with the same structure

  • remove all pos and lexical constraints on args based on dict_generalize_strategy

  • remove all lexical constraints on relations

  • TODO use lexicon (e.g. WordNet) to include semantic generalizations (i.e. hypernym) for known lexical constraints

Parameters
  • list_patterns (list) – list of open template extraction strings ready for parsing using comp_sem_lib.parse_extraction_pattern()

  • topN (int) – top N templates to return (-1 for all)

  • lang (str) – WordNet language

  • dict_generalize_strategy (dict) – { ‘relax_lex’ : list_of_var_types, ‘relax_pos’ : list_of_var_types }

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of open template extraction strings read for parsing using comp_sem_lib.parse_extraction_pattern()

Return type

list

soton_corenlppy.re.comp_sem_lib.parse_allowed_dep_set(allowed_dep_set=None, dict_openie_config=None)[source]

internal function used by generate_open_extraction_templates() and match_extraction_patterns() via collapse_graph_to_make_phrase()

Parameters
  • allowed_dep_set (set) – set of allowed dependancy types to include in the collapsed phrase. supports lexical constraints (e.g. ‘case:of’) for a finer grained filtering. None to allow any dep.

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of tuple = [ ( dep, wild_card, [ conditionals, … ] ), … ]

Return type

list

soton_corenlppy.re.comp_sem_lib.parse_encoded_extraction(encoded_str=None, dict_openie_config=None)[source]

parse an encoded extraction produced from comp_sem_lib.encode_extraction(). variables ordered by address

Parameters
  • encoded_str (unicode) – encoded extraction from comp_sem_lib.pretty_print_extraction( style=’encoded’ )

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of extracted variables = [ ( var_name, var_head, var_phrase, { dep_path : [ var,var… ], … }, address, pattern_index, var_phrase_human ), … ]

Return type

list

soton_corenlppy.re.comp_sem_lib.parse_extraction_pattern(str_pattern=None, dict_openie_config=None)[source]

parse an open pattern template serialized as a unicode string into a format easier to process by the function comp_sem_lib.match_extraction_patterns().

the open pattern template represents a directed walk of a dependency parsed sentence graph. the pattern elements are matched left to right. arguments represent noun phrases. relations represent verbs. slots represent context to the relation such as adverbs.

pattern elements:
  • {varN_1:POS_MARK:pos=TAG|TAG|…;lex=TOKEN|TOKEN|…} = variable node (suffix ‘_<seed_token_id>’) that has a POS tag in the set defined (any if set not defined), and token in the set defined (any if set not defined). POS_MARK = S (start), E (end) or - (somewhere in middle of sent). var = {arg, rel, context, prep …}

  • <dep_label< = instruction to permanently move up dependency tree via a specific dependency label (abort if dep_label not found)

  • >dep_label> = instruction to permanently move down dependency tree via a specific dependency label (abort if dep_label not found)

  • -dep_label- = instruction to move to siblings, bind next variable then return to the original position in the graph (abort if dep_label not found)

  • +dep_label+ = instruction to move to children, bind next variable then return to the original position in the graph (abort if dep_label not found)

pattern examples:
  • {arg1_1:} <nsubj< {rel1_1:pos=VBD} >dobj> {arg2_1:}

  • {rel1_1:pos=VBN;lex=announce|choose} <amod< {context0_1:pos=JJ} +case+ {arg1_1:} -appos- {arg2_1:}

Parameters
  • str_pattern (unicode) – serialized open pattern template

  • dict_inverted_index_pos_phrase_patterns (dict) – inverted index from soton_corenlppy.re.openie_lib.calc_inverted_index_pos_phrase_patterns()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of tuples, each with a pattern instructions e.g. [ ( ‘rel’, ‘rel1_1’, ‘-‘, ( (‘vbz,vbn’),(‘contains’,’holds’) ) ), ( ‘dep_child’, ‘aux’ ), … ]

Return type

list

soton_corenlppy.re.comp_sem_lib.parse_sent_trees(list_sent_trees=None, dep_parser=None, dict_custom_pos_mappings={}, space_replacement_char='_', dict_openie_config=None)[source]

parse a list of sentence trees from soton_corenlppy.common_parse_lib.create_sent_trees() and return a list of dep graph objects

Parameters
  • list_sent_trees (list) – list of stanford POS tagged sent trees

  • dep_parser (list) – list of commands for popen() from get_dependency_parser()

  • dict_custom_pos_mappings (dict) – dict of custom POS mappings e.g. { ‘FIGURE’ : ‘CD’, ‘TABLE’ : ‘CD’, … }

  • space_replacement_char (str) – replacement char for all token spaces

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

list of nltk.parse.DependencyGraph

Return type

list

soton_corenlppy.re.comp_sem_lib.parse_sent_trees_batch(dict_doc_sent_trees=None, dep_parser=None, dict_custom_pos_mappings={}, space_replacement_char='_', max_processes=4, dict_openie_config=None)[source]

dependency parse a batch of documents, each with a list of Stanford POS tagged sents from soton_corenlppy.common_parse_lib.create_sent_trees() use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive.

Parameters
  • dict_doc_sent_trees (dict) – dict of documents { docID : list of sent trees }

  • dep_parser (list) – list of commands for popen() from get_dependency_parser()

  • dict_custom_pos_mappings (dict) – dict of custom POS mappings e.g. { ‘FIGURE’ : ‘CD’, ‘TABLE’ : ‘CD’, … }

  • space_replacement_char (str) – replacement char for all token spaces

  • max_processes (int) – number of worker processes to spawn using multiprocessing.Process

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

dict of documents { docID : list of nltk.parse.DependencyGraph }

Return type

dict

soton_corenlppy.re.comp_sem_lib.parse_sent_trees_worker(tuple_queue=None, dep_parser=None, dict_custom_pos_mappings={}, space_replacement_char='_', pause_on_start=0, process_id=0, dict_openie_config=None)[source]

worker thread for comp_sem_lib.parse_sent_trees_batch()

Parameters
  • tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). the queueIn has ( docID, list_sent_trees ). queueOut has ( docID, list_serialized_graphs )

  • dep_parser (list) – list of commands for popen() from get_dependency_parser()

  • dict_custom_pos_mappings (dict) – dict of custom POS mappings e.g. { ‘FIGURE’ : ‘CD’, ‘TABLE’ : ‘CD’, … }

  • space_replacement_char (str) – replacement char for all token spaces

  • pause_on_start (int) – number of seconds to pause to allow other workers to start

  • process_id (int) – process ID

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.pattern_match_recurse_into_graph(dep_graph=None, graph_address=0, force_next_address=None, pattern_spec=[], pattern_pos=0, pattern_result=None, pattern_success=None, dict_openie_config=None)[source]

internal function called by comp_sem_lib.match_extraction_patterns() try nodes in the dependency graph using a breadth first search strategy. for each search node try to apply the open pattern template and recuse until either failure or success (end of template) the reference parameter pattern_result contains the variables matched at what ever level of recursion the algoriothm has got to. memory note - a copy of the variables found is kept at each seatch node, so if the breadth first search has a lot of combinations then the memory footprint will get large.

Parameters
  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • graph_address (int) – address is the token index in the dependency graph

  • force_next_address (int) – address to force any recursive call (allows a temp node to be explored and then switch back to a previous position in graph)

  • pattern_spec (list) – parsed open pattern template from comp_sem_lib.parse_extraction_pattern()

  • pattern_pos (int) – current position in parsed open pattern template

  • pattern_result (list) – current partially populated patterns = [(var_type, var_name, graph_address), … ]

  • pattern_success (list) – successful fully populated patterns = [(var_type, var_name, graph_address), … ]

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.prepare_tags_for_dependency_parse(list_tagged_sents=None, dict_custom_pos_mappings={}, space_replacement_char='_', dict_openie_config=None)[source]

prepare a list of tagged sents for dependency parsing. the stanford dependency parser gets confused if there are spaces in tokens (e.g. phrases) and does not understand custom POS tags. for following processing is applied to list_tagged_sents:

  • replace all token spaces with a replacement char

  • replace all custom POS tags (e.g. CITATION) with a replacement the taggger understands (e.g. PENN tags)

Parameters
  • list_tagged_sents (list) – reference argument providing a list of tagged sents that will be modified directly i.e. [ [ (token, pos), (token, pos), … ], … ]

  • dict_custom_pos_mappings (dict) – dict of custom POS mappings e.g. { ‘FIGURE’ : ‘CD’, ‘TABLE’ : ‘CD’, … }

  • space_replacement_char (str) – replacement char for all token spaces

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

soton_corenlppy.re.comp_sem_lib.pretty_print_extraction(list_extracted_vars=None, dep_graph=None, set_var_types={}, style='highlighted_vars', space_replacement_char='_', dict_openie_config=None)[source]

pretty print a set of extracted variables from comp_sem_lib.match_extraction_patterns(). arguments are sorted in lexical order for easier reading.

Parameters
  • list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()

  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • set_var_types (set) – set of {var} types for pretty print

  • style (str) – style for print = highlighted_vars, plain_vars, tokens_only

  • space_replacement_char (str) – replacement char for all token spaces. should be same as prepare_tags_for_dependency_parse()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

pretty print version of arguments

Return type

unicode

soton_corenlppy.re.comp_sem_lib.pretty_print_extraction_var(list_extracted_vars=None, dep_graph=None, var_name=None, dict_pretty_dep_rels={'after_head': ['case', 'case:of', 'case:by'], 'any': ['compound', 'amod', 'nummod', 'advmod', 'cop', 'appos', 'dep', 'conj', 'nmod', 'xcomp'], 'before_head': []}, space_replacement_char='_', dict_openie_config=None)[source]

pretty print a specific variable in a specific extraction from comp_sem_lib.match_extraction_patterns(). pretty printed text appears in lexical order for easier reading.

Parameters
  • list_extracted_vars (list) – list of variables from comp_sem_lib.match_extraction_patterns()

  • dep_graph (nltk.parse.DependencyGraph) – dependency graph parsed using a dependency parser such as nltk.parse.stanford.StanfordDependencyParser()

  • var_name (str) – name of variable

  • dict_pretty_dep_rels (dict) – dict of dep rels to allow in pretty print based on address position relative to head { ‘any’ : [], ‘before_head’ : [],’after_head’ : [] }. None allows any dep

  • space_replacement_char (str) – replacement char for all token spaces. should be same as prepare_tags_for_dependency_parse()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( pretty_print_variable, head_token, (negated, genuine), { dep_path : [ var,var… ], … } )

Return type

tuple

soton_corenlppy.re.comp_sem_lib.read_pipe_stderr(pipe_handle, queue_buffer)[source]

internal DEP PARSE process pipe callback function

Parameters
  • file_handle (file) – pipe handle to DEP PARSE errors

  • queue_buffer (Queue.Queue()) – queue where pipe errors can be stored

soton_corenlppy.re.comp_sem_lib.read_pipe_stdout(pipe_handle, queue_buffer)[source]

internal DEP PARSE process pipe callback function

Parameters
  • file_handle (file) – pipe handle to DEP PARSE output

  • queue_buffer (Queue.Queue()) – queue where pipe output can be stored

soton_corenlppy.re.comp_sem_lib.serialize_b4_dep_graph_as_conll2007(root_dep_node=None, root_tokens_node=None, dict_openie_config=None)[source]

take stanford XML parsed bs4 soup nodes and return a serialized CoNLL 2007 formatted dependancy graph

Parameters
  • root_dep_node (bs4.element.Tag) – bs4 root node for basic-dependencies list of dep types

  • root_tokens_node (bs4.element.Tag) – bs4 root node for token list

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

serialized CoNLL 2007 formatted dep graph, with row columns = {sent_index} {word} {lemma} {ctag} {tag} _ {head} {rel} _ _

Return type

unicode

soton_corenlppy.re.comp_sem_lib.serialize_dependency_graph(dep_graph, dict_openie_config)[source]

safely serialize a NLTK dependency graph using to_conll( style=4 ) including the special head ‘_’ token for blank nodes.

the dep_parser.tagged_parse_sents() function will return a Stanford Parser dep graph that has { ‘address’ = None } values for values such as punctuation (e.g. ‘,’). this causes errors re-parsing a serialized graph as the NLTK parser does not allow a value for conll head of ‘None’. instead it needs a special ‘_’ value which Stanford Parser produces.

Parameters
  • dep_graph (nltk.parse.DependencyGraph) – dependency graph to serialize

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

serialized graph that can be read back in using nltk.parse.DependencyGraph( tree_str = str_serialized_graph, top_relation_label = ‘root’ )

Return type

unicode

soton_corenlppy.re.comp_sem_lib.serialize_extraction_pattern(list_pattern=None, dict_openie_config=None)[source]

serialize an open pattern template. see comp_sem_lib.parse_extraction_pattern()

Parameters
  • list_pattern (list) – parsed pattern from comp_sem_lib.parse_extraction_pattern()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

serialized pattern

Return type

unicode

soton_corenlppy.re.comp_sem_lib.unescape_extraction_pattern(token=None)[source]

unescape extraction pattern tokens. see comp_sem_lib.escape_extraction_pattern()

Parameters

token (unicode) – token to escape

Returns

escaped token

Return type

unicode