soton_corenlppy.re.shallow_parse_lib module¶
// (c) Copyright University of Southampton IT Innovation, 2019 // // Copyright in this software belongs to IT Innovation Centre of // Gamma House, Enterprise Road, Southampton SO16 7NS, UK. // // This software may not be used, sold, licensed, transferred, copied // or reproduced in whole or in part in any manner or form or in or // on any media by any person other than in accordance with the terms // of the Licence Agreement supplied with the software, or otherwise // without the prior written consent of the copyright owners. // // This software is distributed WITHOUT ANY WARRANTY, without even the // implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR // PURPOSE, except where stated in the Licence Agreement supplied with // the software. // // Created By : Stuart E. Middleton // Created Date : 2019/05/30 // Created for Project: LPLP // ///////////////////////////////////////////////////////////////////////// // // Dependencies: None // /////////////////////////////////////////////////////////////////////////
-
soton_corenlppy.re.shallow_parse_lib.
eval_chunked_sents
(dict_labelled_sents=None, dict_openie_config=None)[source]¶ score a set of labelled sents created using label_sents_from_chunked()
-
soton_corenlppy.re.shallow_parse_lib.
eval_shallow_parse
(file_to_score=None, dict_scores=None, gold_index=2, dict_openie_config=None)[source]¶ load chunked file and score it
-
soton_corenlppy.re.shallow_parse_lib.
label_sents_from_chunked
(list_IOB_corpus=None, chunked_index=3, gold_index=2, dict_openie_config=None)[source]¶ process a IOB corpus chunked and return a JSON structure with labelled sentences
- BILOU:
B - ‘beginning’ I - ‘inside’ L - ‘last’ O - ‘outside’ U - ‘unit’ (singular occurance)
- Parameters
- Returns
dict of sent labels = { sent_index : { ‘sent’ : ‘…’, ‘pos’ : ‘…’, ‘chunk-labels’ : […], ‘gold-labels’ : […] } }
- Return type
-
soton_corenlppy.re.shallow_parse_lib.
label_sents_from_chunked_file
(chunk_file=None, chunked_index=3, gold_index=2, dict_openie_config=None)[source]¶ load IOB chunked file and return a JSON structure with labelled sentences BILOU:
B - ‘beginning’ I - ‘inside’ L - ‘last’ O - ‘outside’ U - ‘unit’ (singular occurance)
- Parameters
- Returns
dict of sent labels = { sent_index : { ‘sent’ : ‘…’, ‘pos’ : ‘…’, ‘chunk-labels’ : […], ‘gold-labels’ : […] } }
- Return type
-
soton_corenlppy.re.shallow_parse_lib.
shallow_parse_crf
(list_IOB_test_corpus=None, crf_model=None, word2features=<function word2features>, word2featuresConfig=None, log_eval=True, dict_openie_config=None)[source]¶ run a trained scikit learn CRF model on a test IOB corpus for scikit learn CRF the IOB corpus is a list of sentences, each sentence is a list of tokens (phrase, POS, IOB tag). if this is unlabelled test data IOB tag should be ‘O’
- Parameters
list_IOB_test_corpus (list) – test corpus = [ [ (token,pos,iob_tag), … ], … ]
crf_model (sklearn_crfsuite.CRF) – CRF model from train_shallow_parse_crf()
word2features (function) – function pointer to word2features(sent, i) which will be called to generate a feature dict for each word in an IOB sentence.
word2featuresConfig (dict) – dict of config for word2features(), can be None if not needed.
log_eval (bool) – if True eval data will be logged (assuming IOB test corpus has gold tags provided). If False then returned macro_F1 + macro_scores are None
dict_openie_config (dict) – config object returned from openiepy.openie_lib.get_openie_config()
- Returns
list of labelled sentences, each a list of the predicted IOB labels (one for each token in test IOB corpus) e.g. [ [‘B-LOC’, ‘I-LOC’, ‘O’, … ], …] + macro_F1 + macro_scores
- Return type
-
soton_corenlppy.re.shallow_parse_lib.
shallow_parse_crf_plus_plus
(file_IOB_test_corpus=None, list_IOB_test_corpus=None, test_filename=None, model_file=None, template=None, dict_openie_config=None)[source]¶ run a trained CRF++ chunker on a set of test JSON files
- Parameters
file_IOB_test_corpus (str) – test corpus serialized as IOB text file (if None will use list_IOB_test_corpus)
list_IOB_test_corpus (dict) – test corpus = [ (token,pos,iob_tag), …, (‘’,), (token,pos,iob_tag), … ]. (if None will use file_IOB_test_corpus)
test_filename (str) – base filename for output files = test IOB (suffix .iob) and classified IOB (suffix .iob.chunked)
model_file (str) – model filename trained using train_shallow_parse_crf()
template (str) – filename of CRF template
dict_openie_config (dict) – config object returned from openiepy.openie_lib.get_openie_config()
- Returns
name of chunked file created
- Return type
-
soton_corenlppy.re.shallow_parse_lib.
train_shallow_parse_crf
(list_IOB_training_corpus=None, word2features=<function word2features>, word2featuresConfig=None, n_jobs=-1, log_eval=True, params_space={'c1': [0, 0.1, 1.0], 'c2': [0, 0.1, 1.0]}, num_folds=3, all_possible_transitions=True, dict_openie_config=None)[source]¶ train a scikit learn CRF model using an IOB training corpus. for scikit learn CRF the IOB corpus is a list of sentences, each sentence is a list of tokens (phrase, POS, … other features, IOB tag). CRF trained model is returned as an object to allow efficient in-memory running (unlike CRF++ version of this function that serializes IOB training corpus and runs an EXE to classify). model params are optimised using sklearn.model_selection.GridSearchCV
- Parameters
list_IOB_training_corpus (list) – training corpus = [ [ (token,pos,…,iob_tag), … ], … ]
word2features (function) – function pointer to word2features(sent, i, dict_config) which will be called to generate a feature dict for each word in an IOB sentence.
word2featuresConfig (dict) – dict of config for word2features(), can be None if not needed.
n_jobs (int) – number of jobs to spawn for random param search (optimising CRF model training) (-1 uses all available processors)
log_eval (bool) – if True eval data will be logged. if False micro_f1 is None
all_possible_transitions (bool) – if True negative transitions will be considered, generating L**2 transitions from L. will improve accuracy but be very slow to compute for larger datasets.
params_space (dict) – param space for GridSearchCV (default gives a basic search with 9 conbinations of c1 and c2 params total)
num_folds (int) – number of folds for GridSearchCV
dict_openie_config (dict) – config object returned from openiepy.openie_lib.get_openie_config()
- Returns
CRF model object trained, float
- Return type
sklearn_crfsuite.CRF, macro_F1
-
soton_corenlppy.re.shallow_parse_lib.
train_shallow_parse_crf_plus_plus
(file_IOB_training_corpus=None, list_IOB_training_corpus=None, training_filename=None, template=None, dict_openie_config=None)[source]¶ create a training IOB file from a set of JSON documents and run a chunker (CRF++) to create a model file. for CRF++ the IOB training corpus is a token list, with a (‘’,) tuple terminating a sentence
- Parameters
file_IOB_training_corpus (str) – training corpus serialized as IOB text file (if None will use list_IOB_training_corpus)
list_IOB_training_corpus (dict) – training corpus = [ (token,pos,iob_tag), …, (‘’,), (token,pos,iob_tag), … ]. (if None will use file_IOB_training_corpus)
training_filename (str) – base filename for output files = training IOB (suffix .iob) and training model files (suffix .model)
template (str) – filename of CRF template
dict_openie_config (dict) – config object returned from openiepy.openie_lib.get_openie_config()
- Returns
name of model file created
- Return type