soton_corenlppy.re.sem_map_lib module¶
Semantic mapping for openie library
-
soton_corenlppy.re.sem_map_lib.
apply_semantic_mappings_to_extractions
(mapped_extracted_vars=None, list_parsed_semantic_patterns=None, this_uri=None, stemmer=None, binding_strategy='best_one', namespace_unknown_vocab='unknown_vocab', dict_openie_config=None)[source]¶ apply a list of parsed semantic patterns to a set of mapped extraction variables (originating from a sent) to generate a set of resulting RDF productions in the form (subj pred obj). each extraction is checked against the mapping patterns. the extracted variables are bound to mapping pattern variables, and if all variables are successfully bound the associated production is generated. where multiple binding options exist, all the the top confidence value, either a single production is generated (best_one strategy) or all possibile productions are generated (best_all strategy).
- Parameters
mapped_extracted_vars (list) – list of extraction variables mapped using sem_map_lib.map_extractions_to_lexicon()
list_parsed_semantic_patterns (list) – list of parsed semantic mapping patterns from sem_map_lib.import_semantic_mapping_patterns()
this_uri (str) – optional uri value for ‘this’ semantic pattern production entry (default is None). this allows productions to be generated that reference an implied subject uri. for example text for a physical object description where the physical object is not explicitly mentioned but implied.
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
binding_strategy (str) – strategy for choosing variable bindings where multiple options exist = best_one|best_all. best_one will generate a production with the first occuring binding with the highest confidence value (i.e. a single production is generated). best_all will generate a production with all occuring bindings with the highest confidence value (i.e. many productions are generated).
namespace_unknown_vocab (str) – namespace to use for production URIs for text that matches the semantic mapping pattern conditions, but for which there are no lexicon matches. a None value will disable use of such non-lexicon text in output productions.
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()
- Returns
set of RDF production triples = [ (subj pred obj), … ]
- Return type
-
soton_corenlppy.re.sem_map_lib.
generate_uri_for_literals
(literal_value=None, namespace=None, dict_openie_config=None)[source]¶ generate a safe RDF TTL entry for literal values which do not have any lexicon SKOS URI
- Parameters
literal_value (unicode) – literal text
namespace (unicode) – namespace for RDF node
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()
- Returns
TTL formatted RDF node
- Return type
unicode
-
soton_corenlppy.re.sem_map_lib.
import_semantic_mapping_patterns
(filename_patterns=None, dict_openie_config=None)[source]¶ import from disk a set of serialized semantic mapping patterns (newline delimited). see sem_map_lib.parse_semantic_mapping_pattern() for pattern format.
-
soton_corenlppy.re.sem_map_lib.
map_encoded_extraction_to_lexicon
(extraction_vars=None, lex_phrase_index=None, lex_uri_index=None, only_best_matches=False, stemmer=None, max_gram=5, dict_openie_config=None)[source]¶ for an extraction map variable phrases to lexicon phrases. this will semantically ground extracted phrases in the sent to lexicon URIs. a phrase gram size is associated with each variable semantic mapping. all potential mappings are returned but the highest gram size matches are most likely to be a good mapping. the confidence score is based on the precentage of tokens in an extracted variable that match the lexicon phrase. variables without a semantic mapping are removed from the final extraction list. the returned var_phrase and matched_phrase entries are lower() and have had tokens stemmed.
- Parameters
extraction_vars (list) – extraction var produced from soton_corenlppy.re.comp_sem_lib.parse_encoded_extraction()
lex_phrase_index (dict) – lexicon phrase index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
lex_uri_index (dict) – lexicon uri index from soton_corenlppy.lexico.lexicon_lib.import_lexicon()
only_best_matches (bool) – only return variable matches with the highest confidence score for that variable
stemmer (nltk.stem.api.StemmerI) – stemmer to use on phrases (default is None)
max_gram (int) – maximum phrase gram size to check for matches in lexicon. larger gram sizes means more lexicon checks, which is slower.
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()
- Returns
semantically mapped sent extractions = [ [ (var_name, var_phrase, var_gram_size, [lexicon_uri, schema_uri, matched_phrase, phrase_gram_size, confidence_score] ), … ]
- Return type
-
soton_corenlppy.re.sem_map_lib.
parse_semantic_mapping_pattern
(str_pattern=None, dict_openie_config=None)[source]¶ parse a serialized semantic mapping pattern into a structure suitable for efficient mapping on demand. phrase constraints must have spaces replaced with the ‘_’ token to avoid parsing ambiguity.
example patterns below
{arg1:schema=<http://collection.britishmuseum.org/id/thesauri/object>|<http://collection.britishmuseum.org/id/thesauri/subject>} -> {this rso:PX_object_type arg1} {arg1:schema=<http://collection.britishmuseum.org/id/thesauri/object>} {rel1:phrase=carved|weaved|sculpted} {prep1:phrase=from} {arg2:schema=<http://collection.britishmuseum.org/id/thesauri/material>} -> {this crm:P45_consists_of arg2} {arg1:schema=<http://collection.britishmuseum.org/id/thesauri/object>} {rel1:skos=<http://collection.britishmuseum.org/id/thesauri/script/carved>} {prep1:phrase=from} {arg2:schema=<http://collection.britishmuseum.org/id/thesauri/material>} -> {this crm:P45_consists_of arg2}
- Parameters
str_pattern (unicode) – serialized semantic mapping pattern
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()
- Returns
tuple_pattern = ( list_conditions, tuple_production ). list_conditions = [ ( var_type, var_name, schema_uri[], phrase[], skos_uri[] ), … ]. tuple_production == [ ( subject, object, predicate ), … ].
- Return type