soton_corenlppy.re.dataset_support_lib module¶
Support lib for working with pretrained embedding datasets and other large NLP corpora
-
soton_corenlppy.re.dataset_support_lib.
create_corpus
(dict_propbank=None, pad_to_size=None, test_fraction=0.1, dict_openie_config=None)[source]¶ create a BERT style training corpus from propbank data. e.g. sent = [CLS] … [SEP] … [SEP] [PAD] [PAD] [PAD] …
- Parameters
- Returns
tuple = ( list_train_corpus_words, list_train_corpus_tags, list_test_corpus_words, list_test_corpus_tags )
- Return type
-
soton_corenlppy.re.dataset_support_lib.
generate_sequence
(index_words=None, index_tags=None, index_predicates=None, padding_word_value=None, padding_tag_value=None, padding_predicate_value=None, sequence_length=None, list_word_sets=None, list_tag_sets=None, dict_openie_config=None)[source]¶ make a set of (word,tag) sequences for each sentence
- Parameters
index_words (dict) – index of words from generate_vocab()
index_tags (dict) – index of tags from generate_vocab()
index_predicates (dict) – index of predicates from generate_vocab()
padding_word_value (int) – value of pad word in the index
padding_tag_value (int) – value of pad word in the index
padding_predicate_value (int) – value of pad word in the index
sequence_length (int) – max length of sequence (e.g. sentence length) - can be None for no limit. This is needed to limit sents to a fixed size for use in embeddings with BERT
list_word_sets (list) – list of vocab words from generate_vocab()
list_tag_sets (list) – list of vocab words from generate_vocab()
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()
- Returns
tuple = ( list_seq_words, list_seq_tags_categorical, list_seq_predicates_categorical, max_seq_length )
- Return type
-
soton_corenlppy.re.dataset_support_lib.
generate_vocab
(list_word_sets=None, list_tag_sets=None, dict_openie_config=None)[source]¶ make a set of (word,tag) sequences for a sentence in BERT format. e.g. sent = [CLS] … [SEP] … [SEP] [PAD] [PAD] [PAD] …
- Parameters
- Returns
tuple = ( list_words_vocab, list_tags_vocab, list_predicates_vocab, dict_index_words, dict_index_tags, dict_index_predicates, index_pad_word, index_pad_tag, index_pad_predicate )
- Return type
-
soton_corenlppy.re.dataset_support_lib.
read_propbank
(propbank_dir=None, ewt_dir=None, max_files=None, dict_openie_config=None)[source]¶ read in the Propbank dataset and cross-index it with the English Web Treebank dataset to provide a set of SRL annotated sentences.
- Parameters
propbank_dir (unicode) – location of Propbank dataset dir
ewt_dir (unicode) – location of English Web Treebank dataset dir
max_files (int) – max number of files to load (None for all files). this is useful for testing purposes.
dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()
- Returns
dict of Propbank file sent SRL annotations = { EWT_filename : { sent_index : ( [ word_token1, … ], [ pos_token1, … ], [ [ iob_token1, … ], … x N_clauses_in_sent ] ) } }
- Return type
-
soton_corenlppy.re.dataset_support_lib.
read_streusle
(streusle_home=None, allowed_id_set=None, dict_config=None)[source]¶ read in streusle dataset. for information about the corpus see https://github.com/nert-nlp/streusle/blob/master/CONLLULEX.md
- Parameters
- Returns
dict = { sent_index : { ‘sent_id’ : <str>, ‘text’ : <str>, ‘tokens’ : <list>, ‘phrases’ : <dict>, ‘phrases_addr’ : <dict> }. ‘tokens’ will have a list of 19 columns with multi-word extraction address ranges converted from str to tuple(mwe_id,rel_position_in_mwe). ‘phrases’ is a dict with key verb|prep|noun and value of phrases. ‘phrases_addr’ is the same but with a list of token addresses for each phrase not a string.
- Return type
-
soton_corenlppy.re.dataset_support_lib.
sentences_to_IOB
(sentences_file=None, allowed_id_set=None, tokenize_sents=True, max_processes=1, dict_config=None)[source]¶ read in a sentence file, then do POS tagging and generate a IOB file with IOB tag set to default ‘O’. if sentence file has no tabs its assumed sent index is the row number. if sentence file has tabs its assumed 1st column is sent index, second column is text.
- Parameters
sentences_file (unicode) – sentences filename to read
allowed_id_set (list) – list of allowed IDs (None for no filter)
tokenize_sents (bool) – if True use Treebank to tokenize, otherwise split sent using spaces
max_processes (int) – max number of processes to use for POS tagging
dict_config (dict) – config object
- Returns
sent_ID’s, sent_IOB = list of sent ID’s; list of sents, each a list of IOB annotated token entries = [ [ ( token, pos, IOB ), … ], … ]
- Return type