soton_corenlppy.re.shallow_parse_lib module

// (c) Copyright University of Southampton IT Innovation, 2019 // // Copyright in this software belongs to IT Innovation Centre of // Gamma House, Enterprise Road, Southampton SO16 7NS, UK. // // This software may not be used, sold, licensed, transferred, copied // or reproduced in whole or in part in any manner or form or in or // on any media by any person other than in accordance with the terms // of the Licence Agreement supplied with the software, or otherwise // without the prior written consent of the copyright owners. // // This software is distributed WITHOUT ANY WARRANTY, without even the // implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR // PURPOSE, except where stated in the Licence Agreement supplied with // the software. // // Created By : Stuart E. Middleton // Created Date : 2019/05/30 // Created for Project: LPLP // ///////////////////////////////////////////////////////////////////////// // // Dependencies: None // /////////////////////////////////////////////////////////////////////////

soton_corenlppy.re.shallow_parse_lib.eval_chunked_sents(dict_labelled_sents=None, dict_openie_config=None)[source]

score a set of labelled sents created using label_sents_from_chunked()

Parameters
  • dict_labelled_sents (dict) – sent labels created using label_sents_from_chunked()

  • dict_openie_config (dict) – config object

Returns

dict of scores = { ‘macro-averaged’ : …, ‘micro-averaged’ : … }

Return type

dict

soton_corenlppy.re.shallow_parse_lib.eval_shallow_parse(file_to_score=None, dict_scores=None, gold_index=2, dict_openie_config=None)[source]

load chunked file and score it

Parameters
  • file_to_score (str) – file to load and score

  • dict_scores (dict) – dict where scores will be recorded

  • gold_index (int) – index in IOB tuple with the gold truth label (IOB prediction is always at end of tuple)

  • dict_openie_config (dict) – config object

soton_corenlppy.re.shallow_parse_lib.label_sents_from_chunked(list_IOB_corpus=None, chunked_index=3, gold_index=2, dict_openie_config=None)[source]

process a IOB corpus chunked and return a JSON structure with labelled sentences

BILOU:

B - ‘beginning’ I - ‘inside’ L - ‘last’ O - ‘outside’ U - ‘unit’ (singular occurance)

Parameters
  • list_IOB_corpus (list) – corpus = [ [ (token,pos,…,iob_tag), … ], … ]

  • chunked_index (int) – index of predicted IOB label. can be None.

  • gold_index (int) – index of gold IOB label. can be None.

  • dict_openie_config (dict) – config object

Returns

dict of sent labels = { sent_index : { ‘sent’ : ‘…’, ‘pos’ : ‘…’, ‘chunk-labels’ : […], ‘gold-labels’ : […] } }

Return type

dict

soton_corenlppy.re.shallow_parse_lib.label_sents_from_chunked_file(chunk_file=None, chunked_index=3, gold_index=2, dict_openie_config=None)[source]

load IOB chunked file and return a JSON structure with labelled sentences BILOU:

B - ‘beginning’ I - ‘inside’ L - ‘last’ O - ‘outside’ U - ‘unit’ (singular occurance)

Parameters
  • chunk_file (str) – chunk file to load

  • chunked_index (int) – index in IOB tuple with the chunked label. can be None.

  • gold_index (int) – index in IOB tuple with the gold truth label (IOB prediction is always at end of tuple). can be None.

  • dict_openie_config (dict) – config object

Returns

dict of sent labels = { sent_index : { ‘sent’ : ‘…’, ‘pos’ : ‘…’, ‘chunk-labels’ : […], ‘gold-labels’ : […] } }

Return type

dict

soton_corenlppy.re.shallow_parse_lib.shallow_parse_crf(list_IOB_test_corpus=None, crf_model=None, word2features=<function word2features>, word2featuresConfig=None, log_eval=True, dict_openie_config=None)[source]

run a trained scikit learn CRF model on a test IOB corpus for scikit learn CRF the IOB corpus is a list of sentences, each sentence is a list of tokens (phrase, POS, IOB tag). if this is unlabelled test data IOB tag should be ‘O’

Parameters
  • list_IOB_test_corpus (list) – test corpus = [ [ (token,pos,iob_tag), … ], … ]

  • crf_model (sklearn_crfsuite.CRF) – CRF model from train_shallow_parse_crf()

  • word2features (function) – function pointer to word2features(sent, i) which will be called to generate a feature dict for each word in an IOB sentence.

  • word2featuresConfig (dict) – dict of config for word2features(), can be None if not needed.

  • log_eval (bool) – if True eval data will be logged (assuming IOB test corpus has gold tags provided). If False then returned macro_F1 + macro_scores are None

  • dict_openie_config (dict) – config object returned from openiepy.openie_lib.get_openie_config()

Returns

list of labelled sentences, each a list of the predicted IOB labels (one for each token in test IOB corpus) e.g. [ [‘B-LOC’, ‘I-LOC’, ‘O’, … ], …] + macro_F1 + macro_scores

Return type

list, float, dict

soton_corenlppy.re.shallow_parse_lib.shallow_parse_crf_plus_plus(file_IOB_test_corpus=None, list_IOB_test_corpus=None, test_filename=None, model_file=None, template=None, dict_openie_config=None)[source]

run a trained CRF++ chunker on a set of test JSON files

Parameters
  • file_IOB_test_corpus (str) – test corpus serialized as IOB text file (if None will use list_IOB_test_corpus)

  • list_IOB_test_corpus (dict) – test corpus = [ (token,pos,iob_tag), …, (‘’,), (token,pos,iob_tag), … ]. (if None will use file_IOB_test_corpus)

  • test_filename (str) – base filename for output files = test IOB (suffix .iob) and classified IOB (suffix .iob.chunked)

  • model_file (str) – model filename trained using train_shallow_parse_crf()

  • template (str) – filename of CRF template

  • dict_openie_config (dict) – config object returned from openiepy.openie_lib.get_openie_config()

Returns

name of chunked file created

Return type

str

soton_corenlppy.re.shallow_parse_lib.train_shallow_parse_crf(list_IOB_training_corpus=None, word2features=<function word2features>, word2featuresConfig=None, n_jobs=-1, log_eval=True, params_space={'c1': [0, 0.1, 1.0], 'c2': [0, 0.1, 1.0]}, num_folds=3, all_possible_transitions=True, dict_openie_config=None)[source]

train a scikit learn CRF model using an IOB training corpus. for scikit learn CRF the IOB corpus is a list of sentences, each sentence is a list of tokens (phrase, POS, … other features, IOB tag). CRF trained model is returned as an object to allow efficient in-memory running (unlike CRF++ version of this function that serializes IOB training corpus and runs an EXE to classify). model params are optimised using sklearn.model_selection.GridSearchCV

Parameters
  • list_IOB_training_corpus (list) – training corpus = [ [ (token,pos,…,iob_tag), … ], … ]

  • word2features (function) – function pointer to word2features(sent, i, dict_config) which will be called to generate a feature dict for each word in an IOB sentence.

  • word2featuresConfig (dict) – dict of config for word2features(), can be None if not needed.

  • n_jobs (int) – number of jobs to spawn for random param search (optimising CRF model training) (-1 uses all available processors)

  • log_eval (bool) – if True eval data will be logged. if False micro_f1 is None

  • all_possible_transitions (bool) – if True negative transitions will be considered, generating L**2 transitions from L. will improve accuracy but be very slow to compute for larger datasets.

  • params_space (dict) – param space for GridSearchCV (default gives a basic search with 9 conbinations of c1 and c2 params total)

  • num_folds (int) – number of folds for GridSearchCV

  • dict_openie_config (dict) – config object returned from openiepy.openie_lib.get_openie_config()

Returns

CRF model object trained, float

Return type

sklearn_crfsuite.CRF, macro_F1

soton_corenlppy.re.shallow_parse_lib.train_shallow_parse_crf_plus_plus(file_IOB_training_corpus=None, list_IOB_training_corpus=None, training_filename=None, template=None, dict_openie_config=None)[source]

create a training IOB file from a set of JSON documents and run a chunker (CRF++) to create a model file. for CRF++ the IOB training corpus is a token list, with a (‘’,) tuple terminating a sentence

Parameters
  • file_IOB_training_corpus (str) – training corpus serialized as IOB text file (if None will use list_IOB_training_corpus)

  • list_IOB_training_corpus (dict) – training corpus = [ (token,pos,iob_tag), …, (‘’,), (token,pos,iob_tag), … ]. (if None will use file_IOB_training_corpus)

  • training_filename (str) – base filename for output files = training IOB (suffix .iob) and training model files (suffix .model)

  • template (str) – filename of CRF template

  • dict_openie_config (dict) – config object returned from openiepy.openie_lib.get_openie_config()

Returns

name of model file created

Return type

str

soton_corenlppy.re.shallow_parse_lib.word2features(sent, i, dict_config=None)[source]

internal word2feature function for a default. called by train_shallow_parse_crf()

Parameters
  • sent (list) – sent = [ (token,pos,…,iob_tag), … ]

  • i (int) – index of token in sent to generate IOB features for

  • dict_context (dict) – static context for feature generation (can be None)

Returns

crf feature dict for this token in this sent

Return type

dict