soton_corenlppy.re.dataset_support_lib module¶

Support lib for working with pretrained embedding datasets and other large NLP corpora

soton_corenlppy.re.dataset_support_lib.create_corpus(dict_propbank=None, pad_to_size=None, test_fraction=0.1, dict_openie_config=None)[source]¶

create a BERT style training corpus from propbank data. e.g. sent = [CLS] … [SEP] … [SEP] [PAD] [PAD] [PAD] …

Parameters

dict_propbank (dict) – propbank data from read_propbank()
pad_to_size (int) – size to pad sequences to (can be None)
test_fraction (float) – fraction of corpus to use as test data
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( list_train_corpus_words, list_train_corpus_tags, list_test_corpus_words, list_test_corpus_tags )

Return type

tuple

soton_corenlppy.re.dataset_support_lib.generate_sequence(index_words=None, index_tags=None, index_predicates=None, padding_word_value=None, padding_tag_value=None, padding_predicate_value=None, sequence_length=None, list_word_sets=None, list_tag_sets=None, dict_openie_config=None)[source]¶

make a set of (word,tag) sequences for each sentence

Parameters

index_words (dict) – index of words from generate_vocab()
index_tags (dict) – index of tags from generate_vocab()
index_predicates (dict) – index of predicates from generate_vocab()
padding_word_value (int) – value of pad word in the index
padding_tag_value (int) – value of pad word in the index
padding_predicate_value (int) – value of pad word in the index
sequence_length (int) – max length of sequence (e.g. sentence length) - can be None for no limit. This is needed to limit sents to a fixed size for use in embeddings with BERT
list_word_sets (list) – list of vocab words from generate_vocab()
list_tag_sets (list) – list of vocab words from generate_vocab()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( list_seq_words, list_seq_tags_categorical, list_seq_predicates_categorical, max_seq_length )

Return type

tuple

soton_corenlppy.re.dataset_support_lib.generate_vocab(list_word_sets=None, list_tag_sets=None, dict_openie_config=None)[source]¶

make a set of (word,tag) sequences for a sentence in BERT format. e.g. sent = [CLS] … [SEP] … [SEP] [PAD] [PAD] [PAD] …

Parameters

list_word_sets (list) – list of words for each sent
list_tag_sets (list) – list of tags for each sent
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( list_words_vocab, list_tags_vocab, list_predicates_vocab, dict_index_words, dict_index_tags, dict_index_predicates, index_pad_word, index_pad_tag, index_pad_predicate )

Return type

tuple

soton_corenlppy.re.dataset_support_lib.read_propbank(propbank_dir=None, ewt_dir=None, max_files=None, dict_openie_config=None)[source]¶

read in the Propbank dataset and cross-index it with the English Web Treebank dataset to provide a set of SRL annotated sentences.

Parameters

propbank_dir (unicode) – location of Propbank dataset dir
ewt_dir (unicode) – location of English Web Treebank dataset dir
max_files (int) – max number of files to load (None for all files). this is useful for testing purposes.
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

dict of Propbank file sent SRL annotations = { EWT_filename : { sent_index : ( [ word_token1, … ], [ pos_token1, … ], [ [ iob_token1, … ], … x N_clauses_in_sent ] ) } }

Return type

dict

soton_corenlppy.re.dataset_support_lib.read_streusle(streusle_home=None, allowed_id_set=None, dict_config=None)[source]¶

read in streusle dataset. for information about the corpus see https://github.com/nert-nlp/streusle/blob/master/CONLLULEX.md

Parameters

streusle_home (unicode) – dir of streusle dataset
allowed_id_set (list) – list of allowed IDs (None for no filter)
dict_config (dict) – config object

Returns

dict = { sent_index : { ‘sent_id’ : <str>, ‘text’ : <str>, ‘tokens’ : <list>, ‘phrases’ : <dict>, ‘phrases_addr’ : <dict> }. ‘tokens’ will have a list of 19 columns with multi-word extraction address ranges converted from str to tuple(mwe_id,rel_position_in_mwe). ‘phrases’ is a dict with key verb|prep|noun and value of phrases. ‘phrases_addr’ is the same but with a list of token addresses for each phrase not a string.

Return type

dict

soton_corenlppy.re.dataset_support_lib.sentences_to_IOB(sentences_file=None, allowed_id_set=None, tokenize_sents=True, max_processes=1, dict_config=None)[source]¶

read in a sentence file, then do POS tagging and generate a IOB file with IOB tag set to default ‘O’. if sentence file has no tabs its assumed sent index is the row number. if sentence file has tabs its assumed 1st column is sent index, second column is text.

Parameters

sentences_file (unicode) – sentences filename to read
allowed_id_set (list) – list of allowed IDs (None for no filter)
tokenize_sents (bool) – if True use Treebank to tokenize, otherwise split sent using spaces
max_processes (int) – max number of processes to use for POS tagging
dict_config (dict) – config object

Returns

sent_ID’s, sent_IOB = list of sent ID’s; list of sents, each a list of IOB annotated token entries = [ [ ( token, pos, IOB ), … ], … ]

Return type

list, list

soton_corenlppy.re.dataset_support_lib.streusle_to_IOB(dict_sent_set=None, max_processes=1, dict_config=None)[source]¶

compute a set of IOB tags for [noun, verb, prep] from a sent set returned by read_streusle()

Parameters

dict_sent_set (dict) – dict returned by read_streusle()
max_processes (int) – max number of processes to use for POS tagging
dict_config (dict) – config object

Returns

list of sents, each a list of IOB annotated token entries = [ [ ( token, pos, IOB ), … ], … ]

Return type

list

soton_corenlppy.re.dataset_support_lib module¶

soton_corenlppy

Navigation

Related Topics