soton_corenlppy.common_parse_lib module¶
common parse lib supporting tokenization, POS tagging and sentence management
- 
soton_corenlppy.common_parse_lib.check_retweet(original_text)[source]¶
- check for rwteeets (assumes raw unprocessed text e.g. ‘RT @username …’) - Parameters
- original_text (unicode) – UTF-8 text to clean 
- Returns
- true if text contains a retweet pattern 
- Return type
 
- 
soton_corenlppy.common_parse_lib.clean_text(original_text, dict_common_config, whitespace_chars=None)[source]¶
- clean a block of unicode text ready for tokenization. replace sequences of whitespace with a single space. if config[lower_tokens] = True then make text lowercase. if config[apostrophe_handling] = ‘preserve’ then ensure appos entries are preserved (even if appos is a whitespac character) if config[apostrophe_handling] = ‘strip’ then ensure appos entries are removed - Parameters
- original_text (unicode) – UTF-8 text to clean 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
- whitespace_chars (unicode) – whitespace characters. if None the configuration setting will be used in dict_common_config 
 
- Returns
- clean text 
- Return type
- unicode 
 
- 
soton_corenlppy.common_parse_lib.create_ngram_tokens(list_tokens, max_gram=4, sent_temination_tokens=None)[source]¶
- compile n-gram phrase sets keeping the linear sequence of tokens intact up to a maximum gram size the optional sent_temination_tokens prevents n-gram tokens spanning sent terminator tokens (e.g. newlines) - Parameters
- Returns
- set of n-gram tokens e.g. [ [(‘one’,),(‘two’,),(‘three’,),(‘four’,)], [(‘one’,’two’), (‘two’,’three’), (‘three’,’four’)], [(‘one’,’two’,’three’),(‘two’,’three’,’four’)] ] 
- Return type
 
- 
soton_corenlppy.common_parse_lib.create_sent_trees(list_pos, list_sent_addr_offsets=None, dict_common_config=None)[source]¶
- create a set of nltk.Tree structures for sentences. sent delimiter characters are taken from dict_common_config[‘sent_token_seps’] and the period character - Parameters
- list_pos (list) – POS tagged sentence e.g. [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ] 
- list_sent_addr_offsets (list) – list which (if not None) will be populated with the start address of each sent (address within the original POS tagged sent) 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
 
- Returns
- list of nltk.Tree sentence structures e.g. [ nltk.Tree(S And/CC now/RB for/IN something/NN completely/RB different/JJ), … ] 
- Return type
 
- 
soton_corenlppy.common_parse_lib.escape_tagged_token(tuple_pos)[source]¶
- escape open and close brackets in a POS token to make it nltk.Tree safe 
- 
soton_corenlppy.common_parse_lib.escape_token(token_str)[source]¶
- escape open and close brackets in a token to make it nltk.Tree safe - Parameters
- token_str (unicode) – token text to process 
- Returns
- unescaped text 
- Return type
- unicode 
 
- 
soton_corenlppy.common_parse_lib.flattern_sent(tree_sent, dict_common_config)[source]¶
- flatten a sent so it has no subtrees. subtrees are flattened to make phrases. this is useful for subsequent processing that requires a tagged list, such as dependency parsing - Parameters
- tree_sent (nltk.Tree) – sentence tree with any number of subtrees (S (CC And) (RB now) (VP (NP New) (NP York)) … ) 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
 
- Returns
- flattened sent tree with phrases for subtrees e.g. (S (CC And) (RB now) (VP New York) … ) 
- Return type
- nltk.Tree 
 
- 
soton_corenlppy.common_parse_lib.flattern_tree_with_heads(tree)[source]¶
- flattern a nltk.Tree preserving the head node, so all tokens are returned (unlike nltk.Tree.leaves()). - Parameters
- tree (nltk.Tree) – tree to flatten 
- Returns
- list of tokens under tree, including the head 
- Return type
 
- 
soton_corenlppy.common_parse_lib.get_common_config(**kwargs)[source]¶
- return a common config object for this specific set of languages. the config object contains an instantiated NLTK stemmer, tokenizer and settings tailored for the chosen language set. all available language specific corpus will be read into memory, such as stoplists. common config settings are below: - stemmer = NLTK stemmer, default is no stemming = nltk.stem.RegexpStemmer(‘’, 100000) 
- t_word = NLTK word tokenizer for chosen language. default is nltk.tokenize.treebank.TreebankWordTokenizer() 
- t_sent = NLTK sent tokenizer for chosen language. default is nltk.tokenize.treebank.PunktSentenceTokenizer() 
- regex_namespace = regre.RegexObject, regex to match namespaces e.g. www.bbc.co.uk 
- regex_url = regre.RegexObject, regex to match URIs e.g. http://www.bbc.co.uk 
- regex_numeric_extract = regre.RegexObject, regex to match numeric strings e.g. 56, 56.76, $54.23, $54 but NOT 52.com 
- lang_codes = list, list of ISO 639-1 2 character language codes e.g. [‘en’,’fr’] 
- stoplist = list, aggregated set of stopwords for languages selected 
- logger = logging.Logger, logger object 
- whitespace = str, string containing whitespace characters that will be removed prior to tokenization. default is “\u201a\u201b\u201c\u201d 
- punctuation = str, string containing punctuation characters that will be forced into thier own token. default is ,;\/:+-#~&*=!? 
- corpus_dir = str, directory where common_parse_lib language specific corpus files are located 
- max_gram = int, maximum size of n-grams for use in create_ngram_tokens() function. default if 4 
- first_names = set, aggragated language specific set of first names 
- lower_tokens = bool, if True text will be converted to lower() before tokenization. default is False 
- sent_token_seps = list, unicode sentence termination tokens. default is [\n, \r\n, \f, \u2026] 
- stanford_tagger_dir = base dir for stanfard POS tagger (e.g. c:stanford-postagger-full) 
- treetagger_tagger_dir = base dir for TreeTagger (e.g. c: reetagger) 
- lang_pos_mapping = dict, set of langauge to PSO tagger mappings. e.g. { ‘en’ : ‘stanford’, ‘ru’ : ‘treetagger’ } 
- pos_sep = tuple, POS separator character and a safe replacement. the default POS separator char is ‘/’ and usually POS tagged sentences become ‘term/POS term/POS …’. when tagging a token containing this character e.g. ‘one/two’ the POS separator character will be replaced prior to serialization to avoid an ambiguous output. 
- token_preservation_regex = dict of key name for regre.RegexObject objects to identify tokens that should be preserved and a unique POS token name (e.g. { ‘regex_namespace’ : ‘NAMESPACE’, ‘regex_url’ : ‘URI’ } ). POS token name must be unique for chosen POS tagger and safe for POS serialization without characters like ‘ ‘ or ‘/’. this dict argument allows additional POS tokens to be added in the future without the need to change the common_parse_lib code. 
 note: a config object approach is used, as opposed to a global variable, to allow common_parse_lib functions to work in a multi-threaded environment- Parameters
- kwargs – variable argument to override any default config values 
- Returns
- configuration settings to be used by all common_parse_lib functions 
- Return type
 
- 
soton_corenlppy.common_parse_lib.is_all_stoplist(list_tokens, dict_common_config)[source]¶
- check to see if tokens are only stoplist tokens 
- 
soton_corenlppy.common_parse_lib.ngram_tokenize_microblog_text(text, dict_common_config)[source]¶
- tokenize a microblog entry (e.g. tweet) into all possible combinations of N-gram phrases keeping the linear sentence structure intact text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens. a set of all possible n-gram tokens is returned up to max-gram - Parameters
- text (unicode) – UTF-8 text to tokenize 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
 
- Returns
- list of n-gram token sets e.g. [ [(‘one’,),(‘two’,),(‘three’,),(‘four’,)], [(‘one’,’two’), (‘two’,’three’), (‘three’,’four’)], [(‘one’,’two’,’three’),(‘two’,’three’,’four’)] ] 
- Return type
 
- 
soton_corenlppy.common_parse_lib.parse_serialized_tagged_tree(serialized_tree, dict_common_config)[source]¶
- parse a previously serialized tree | note: tokens unescaped using replacement characters defined in list_escape_tuples - Parameters
- serialized_tree (unicode) – serialized tree structure containing POS tagged leafs from common_parse_lib.serialize_tagged_tree() 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
 
- Returns
- tree representing POS tagged sentence e.g. (S And/CC now/RB for/IN something/NN completely/RB different/JJ) 
- Return type
- nltk.Tree 
 
- 
soton_corenlppy.common_parse_lib.pos_tag_tokenset(token_set, lang, dict_common_config, timeout=300)[source]¶
- POS tag a batch of tokenized sentences for a specific langauge. it is more efficient to POS tag in large batches as the POS tagger is a separate process that must be invoked using an OS exec and a Python subprocess command. there is a fixed overhead for sub-process and pipe setup time (e.g. 1-2 seconds) so processing text in bulk is more efficient than many small separate sentences. note: the POS tagger used is chosen from TreeTagger, Stanford and Treebank based on language codenote: URL’s and namespaces matching regex patterns provided in dict_common_config will get a POS tag of ‘URI’ or ‘NAMESPACE’ regardless of which POS tagger is usednote: tokens matching characters in dict_common_config[‘sent_token_seps’] will be labelled with a POS tag ‘NEWLINE’- Parameters
- token_set (list) – list of tokens for a set of sentences. each sentence has a token set, which is itself a list of either tokenized phrase tuples or tokenized phrase strings. e.g. [ [ (‘london’,),(‘attacks’,),(‘are’,) … ], … ] e.g. [ [ ‘london’,’attacks’,’are’, … ], … ] 
- lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’]) 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
- timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs 
 
- Returns
- list of POS tagged sentences e.g. [ [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ], … ] 
- Return type
 
- 
soton_corenlppy.common_parse_lib.pos_tag_tokenset_batch(document_token_set, lang, dict_common_config, max_processes=4, timeout=300)[source]¶
- POS tag a batch of tokenized sentences for a specific langauge. it is more efficient to POS tag in large batches as the POS tagger is a separate process that must be invoked using an OS exec and a Python subprocess command. there is a fixed overhead for sub-process and pipe setup time (e.g. 1-2 seconds) so processing text in bulk is more efficient than many small separate sentences. use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive. note: the POS tagger used is chosen from TreeTagger, Stanford and Treebank based on language codenote: URL’s and namespaces matching regex patterns provided in dict_common_config will get a POS tag of ‘URI’ or ‘NAMESPACE’ regardless of which POS tagger is usednote: tokens matching characters in dict_common_config[‘sent_token_seps’] will be labelled with a POS tag ‘NEWLINE’- Parameters
- document_token_set (dict) – { docID : [ token_set for each document sent ] } 
- lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’]) 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
- max_processes (int) – number of worker processes to spawn using multiprocessing.Process 
- timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs 
 
- Returns
- dict of POS tagged documents { docID : [ tagged_token_set for each document sent ] } 
- Return type
 
- 
soton_corenlppy.common_parse_lib.pos_tag_tokenset_batch_worker(tuple_queue=None, lang='en', dict_common_config=None, pause_on_start=0, timeout=300, process_id=0)[source]¶
- worker thread for comp_sem_lib.pos_tag_tokenset_batch() - Parameters
- tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). queueIn has tuples of ( doc_id, [ token_set for each document sent ] ). queueOut has tuples of ( doc_id, [ tagged_token_set for each document sent ] ). 
- lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’]) 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
- max_processes (int) – number of worker processes to spawn using multiprocessing.Process 
- pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also) 
- timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs 
- process_id (int) – ID of process for logging purposes 
 
 
- 
soton_corenlppy.common_parse_lib.read_pipe_stderr(pipe_handle, queue_buffer)[source]¶
- internal POS tagger process pipe callback function - Parameters
- file_handle (file) – pipe handle to POS tagger errors 
- queue_buffer (Queue.Queue()) – queue where pipe errors can be stored 
 
 
- 
soton_corenlppy.common_parse_lib.read_pipe_stdout(pipe_handle, queue_buffer, lines_expected=1)[source]¶
- internal POS tagger process pipe callback function - Parameters
- file_handle (file) – pipe handle to POS tagger output 
- queue_buffer (Queue.Queue()) – queue where pipe output can be stored 
- lines_expected (int) – number of lines expected so we do not read other sentences from pipe 
 
 
- 
soton_corenlppy.common_parse_lib.serialize_tagged_list(list_pos, dict_common_config, serialization_style='pos')[source]¶
- serialize POS tagged tokens (list) | note: the POS separator (e.g. ‘/’) is replaced in all tokens and POS tags so it is always good for a separator in the serialization - Parameters
- list_pos (list) – POS tagged sentence e.g. [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ] 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
- serialization_style (str) – either POS tag list style (pos) or sentence tree style (tree). pos style is ‘and/CC now/RB …’. tree style is ‘(CC and) (RB now) …’ 
 
- Returns
- serialized POS tagged sentence in style requested e.g. ‘new/NN york/NN’ 
- Return type
- unicode 
 
- 
soton_corenlppy.common_parse_lib.serialize_tagged_tree(tree_sent, dict_common_config)[source]¶
- serialize POS tagged tokens (tree). this function will go recursive if the tree has one or more subtrees. | note: all tokens are escaped using escape_token() - Parameters
- tree_sent (nltk.Tree) – POS tagged sentence e.g. (S And/CC now/RB for/IN something/NN completely/RB different/JJ) or (S (CC And) (RB now) … ) 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
 
- Returns
- serialized POS tagged sentence e.g. ‘(S new/NN york/NN -LRB-man-made location-RRB-/PARENTHETICAL_MATERIAL)’ or ‘(S (NP New) (NP York) (PARENTHETICAL_MATERIAL -LRB-man-made location-RRB-))’ 
- Return type
- unicode 
 
- 
soton_corenlppy.common_parse_lib.tokenize_sentence(sent, dict_common_config)[source]¶
- tokenizes a single sentence into stemmed tokens. if nltk.tokenize.treebank.TreebankWordTokenizer is used then tokens will be corrected for embedded punctuation within tokens and embedded periods within tokens unless they are numeric values 
- 
soton_corenlppy.common_parse_lib.unescape_tagged_token(tuple_pos)[source]¶
- unescape open and close brackets in a POS token 
- 
soton_corenlppy.common_parse_lib.unescape_token(token_str)[source]¶
- unescape open and close brackets in a token - Parameters
- token_str (unicode) – token text to process 
- Returns
- unescaped text 
- Return type
- unicode 
 
- 
soton_corenlppy.common_parse_lib.unescape_tree(tree, depth=0)[source]¶
- unescape a nltk.Tree open and close brackets - Parameters
- tree (nltk.Tree) – tree to process 
- depth (int) – recursion depth (internal variable) 
 
- Returns
- unescaped tree 
- Return type
- nltk.Tree 
 
- 
soton_corenlppy.common_parse_lib.unigram_tokenize_text(text=None, include_char_offset=False, dict_common_config=None)[source]¶
- tokenize a text entry (e.g. tweet) into unigram tokens text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens. - Parameters
- text (unicode) – UTF-8 text to tokenize 
- bool (include_char_offset) – if True result is a list of [ list_tokens, list_char_offset_for_tokens ] 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
 
- Returns
- list of unigram tokens e.g. [ ‘one’,’two’,’three’ ] 
- Return type
 
- 
soton_corenlppy.common_parse_lib.unigram_tokenize_text_with_sent_breakdown(text=None, include_char_offset=False, dict_common_config=None)[source]¶
- tokenize a microblog entry (e.g. tweet) into unigram tokens broken down into individual sents. sent token removed at end of each sent. text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens. - Parameters
- text (unicode) – UTF-8 text to tokenize 
- bool (include_char_offset) – if True result is a list of [ list_token_set, list_char_offset_for_token_set ] 
- dict_common_config (dict) – config object returned from common_parse_lib.get_common_config() 
 
- Returns
- list of sents, each itself a list of unigram tokens e.g. [ [ ‘one’,’two’,’three’ ], … ] 
- Return type