soton_corenlppy.re.dataset_support_lib module

Support lib for working with pretrained embedding datasets and other large NLP corpora

soton_corenlppy.re.dataset_support_lib.create_corpus(dict_propbank=None, pad_to_size=None, test_fraction=0.1, dict_openie_config=None)[source]

create a BERT style training corpus from propbank data. e.g. sent = [CLS] … [SEP] … [SEP] [PAD] [PAD] [PAD] …

Parameters
  • dict_propbank (dict) – propbank data from read_propbank()

  • pad_to_size (int) – size to pad sequences to (can be None)

  • test_fraction (float) – fraction of corpus to use as test data

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( list_train_corpus_words, list_train_corpus_tags, list_test_corpus_words, list_test_corpus_tags )

Return type

tuple

soton_corenlppy.re.dataset_support_lib.generate_sequence(index_words=None, index_tags=None, index_predicates=None, padding_word_value=None, padding_tag_value=None, padding_predicate_value=None, sequence_length=None, list_word_sets=None, list_tag_sets=None, dict_openie_config=None)[source]

make a set of (word,tag) sequences for each sentence

Parameters
  • index_words (dict) – index of words from generate_vocab()

  • index_tags (dict) – index of tags from generate_vocab()

  • index_predicates (dict) – index of predicates from generate_vocab()

  • padding_word_value (int) – value of pad word in the index

  • padding_tag_value (int) – value of pad word in the index

  • padding_predicate_value (int) – value of pad word in the index

  • sequence_length (int) – max length of sequence (e.g. sentence length) - can be None for no limit. This is needed to limit sents to a fixed size for use in embeddings with BERT

  • list_word_sets (list) – list of vocab words from generate_vocab()

  • list_tag_sets (list) – list of vocab words from generate_vocab()

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( list_seq_words, list_seq_tags_categorical, list_seq_predicates_categorical, max_seq_length )

Return type

tuple

soton_corenlppy.re.dataset_support_lib.generate_vocab(list_word_sets=None, list_tag_sets=None, dict_openie_config=None)[source]

make a set of (word,tag) sequences for a sentence in BERT format. e.g. sent = [CLS] … [SEP] … [SEP] [PAD] [PAD] [PAD] …

Parameters
  • list_word_sets (list) – list of words for each sent

  • list_tag_sets (list) – list of tags for each sent

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

tuple = ( list_words_vocab, list_tags_vocab, list_predicates_vocab, dict_index_words, dict_index_tags, dict_index_predicates, index_pad_word, index_pad_tag, index_pad_predicate )

Return type

tuple

soton_corenlppy.re.dataset_support_lib.read_propbank(propbank_dir=None, ewt_dir=None, max_files=None, dict_openie_config=None)[source]

read in the Propbank dataset and cross-index it with the English Web Treebank dataset to provide a set of SRL annotated sentences.

Parameters
  • propbank_dir (unicode) – location of Propbank dataset dir

  • ewt_dir (unicode) – location of English Web Treebank dataset dir

  • max_files (int) – max number of files to load (None for all files). this is useful for testing purposes.

  • dict_openie_config (dict) – config object returned from soton_corenlppy.re.openie_lib.get_openie_config()

Returns

dict of Propbank file sent SRL annotations = { EWT_filename : { sent_index : ( [ word_token1, … ], [ pos_token1, … ], [ [ iob_token1, … ], … x N_clauses_in_sent ] ) } }

Return type

dict

soton_corenlppy.re.dataset_support_lib.read_streusle(streusle_home=None, allowed_id_set=None, dict_config=None)[source]

read in streusle dataset. for information about the corpus see https://github.com/nert-nlp/streusle/blob/master/CONLLULEX.md

Parameters
  • streusle_home (unicode) – dir of streusle dataset

  • allowed_id_set (list) – list of allowed IDs (None for no filter)

  • dict_config (dict) – config object

Returns

dict = { sent_index : { ‘sent_id’ : <str>, ‘text’ : <str>, ‘tokens’ : <list>, ‘phrases’ : <dict>, ‘phrases_addr’ : <dict> }. ‘tokens’ will have a list of 19 columns with multi-word extraction address ranges converted from str to tuple(mwe_id,rel_position_in_mwe). ‘phrases’ is a dict with key verb|prep|noun and value of phrases. ‘phrases_addr’ is the same but with a list of token addresses for each phrase not a string.

Return type

dict

soton_corenlppy.re.dataset_support_lib.sentences_to_IOB(sentences_file=None, allowed_id_set=None, tokenize_sents=True, max_processes=1, dict_config=None)[source]

read in a sentence file, then do POS tagging and generate a IOB file with IOB tag set to default ‘O’. if sentence file has no tabs its assumed sent index is the row number. if sentence file has tabs its assumed 1st column is sent index, second column is text.

Parameters
  • sentences_file (unicode) – sentences filename to read

  • allowed_id_set (list) – list of allowed IDs (None for no filter)

  • tokenize_sents (bool) – if True use Treebank to tokenize, otherwise split sent using spaces

  • max_processes (int) – max number of processes to use for POS tagging

  • dict_config (dict) – config object

Returns

sent_ID’s, sent_IOB = list of sent ID’s; list of sents, each a list of IOB annotated token entries = [ [ ( token, pos, IOB ), … ], … ]

Return type

list, list

soton_corenlppy.re.dataset_support_lib.streusle_to_IOB(dict_sent_set=None, max_processes=1, dict_config=None)[source]

compute a set of IOB tags for [noun, verb, prep] from a sent set returned by read_streusle()

Parameters
  • dict_sent_set (dict) – dict returned by read_streusle()

  • max_processes (int) – max number of processes to use for POS tagging

  • dict_config (dict) – config object

Returns

list of sents, each a list of IOB annotated token entries = [ [ ( token, pos, IOB ), … ], … ]

Return type

list