soton_corenlppy.common_parse_lib module¶
common parse lib supporting tokenization, POS tagging and sentence management
-
soton_corenlppy.common_parse_lib.
check_retweet
(original_text)[source]¶ check for rwteeets (assumes raw unprocessed text e.g. ‘RT @username …’)
- Parameters
original_text (unicode) – UTF-8 text to clean
- Returns
true if text contains a retweet pattern
- Return type
-
soton_corenlppy.common_parse_lib.
clean_text
(original_text, dict_common_config, whitespace_chars=None)[source]¶ clean a block of unicode text ready for tokenization. replace sequences of whitespace with a single space. if config[lower_tokens] = True then make text lowercase. if config[apostrophe_handling] = ‘preserve’ then ensure appos entries are preserved (even if appos is a whitespac character) if config[apostrophe_handling] = ‘strip’ then ensure appos entries are removed
- Parameters
original_text (unicode) – UTF-8 text to clean
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
whitespace_chars (unicode) – whitespace characters. if None the configuration setting will be used in dict_common_config
- Returns
clean text
- Return type
unicode
-
soton_corenlppy.common_parse_lib.
create_ngram_tokens
(list_tokens, max_gram=4, sent_temination_tokens=None)[source]¶ compile n-gram phrase sets keeping the linear sequence of tokens intact up to a maximum gram size the optional sent_temination_tokens prevents n-gram tokens spanning sent terminator tokens (e.g. newlines)
- Parameters
- Returns
set of n-gram tokens e.g. [ [(‘one’,),(‘two’,),(‘three’,),(‘four’,)], [(‘one’,’two’), (‘two’,’three’), (‘three’,’four’)], [(‘one’,’two’,’three’),(‘two’,’three’,’four’)] ]
- Return type
-
soton_corenlppy.common_parse_lib.
create_sent_trees
(list_pos, list_sent_addr_offsets=None, dict_common_config=None)[source]¶ create a set of nltk.Tree structures for sentences. sent delimiter characters are taken from dict_common_config[‘sent_token_seps’] and the period character
- Parameters
list_pos (list) – POS tagged sentence e.g. [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ]
list_sent_addr_offsets (list) – list which (if not None) will be populated with the start address of each sent (address within the original POS tagged sent)
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
- Returns
list of nltk.Tree sentence structures e.g. [ nltk.Tree(S And/CC now/RB for/IN something/NN completely/RB different/JJ), … ]
- Return type
-
soton_corenlppy.common_parse_lib.
escape_tagged_token
(tuple_pos)[source]¶ escape open and close brackets in a POS token to make it nltk.Tree safe
-
soton_corenlppy.common_parse_lib.
escape_token
(token_str)[source]¶ escape open and close brackets in a token to make it nltk.Tree safe
- Parameters
token_str (unicode) – token text to process
- Returns
unescaped text
- Return type
unicode
-
soton_corenlppy.common_parse_lib.
flattern_sent
(tree_sent, dict_common_config)[source]¶ flatten a sent so it has no subtrees. subtrees are flattened to make phrases. this is useful for subsequent processing that requires a tagged list, such as dependency parsing
- Parameters
tree_sent (nltk.Tree) – sentence tree with any number of subtrees (S (CC And) (RB now) (VP (NP New) (NP York)) … )
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
- Returns
flattened sent tree with phrases for subtrees e.g. (S (CC And) (RB now) (VP New York) … )
- Return type
nltk.Tree
-
soton_corenlppy.common_parse_lib.
flattern_tree_with_heads
(tree)[source]¶ flattern a nltk.Tree preserving the head node, so all tokens are returned (unlike nltk.Tree.leaves()).
- Parameters
tree (nltk.Tree) – tree to flatten
- Returns
list of tokens under tree, including the head
- Return type
-
soton_corenlppy.common_parse_lib.
get_common_config
(**kwargs)[source]¶ return a common config object for this specific set of languages. the config object contains an instantiated NLTK stemmer, tokenizer and settings tailored for the chosen language set. all available language specific corpus will be read into memory, such as stoplists. common config settings are below:
stemmer = NLTK stemmer, default is no stemming = nltk.stem.RegexpStemmer(‘’, 100000)
t_word = NLTK word tokenizer for chosen language. default is nltk.tokenize.treebank.TreebankWordTokenizer()
t_sent = NLTK sent tokenizer for chosen language. default is nltk.tokenize.treebank.PunktSentenceTokenizer()
regex_namespace = regre.RegexObject, regex to match namespaces e.g. www.bbc.co.uk
regex_url = regre.RegexObject, regex to match URIs e.g. http://www.bbc.co.uk
regex_numeric_extract = regre.RegexObject, regex to match numeric strings e.g. 56, 56.76, $54.23, $54 but NOT 52.com
lang_codes = list, list of ISO 639-1 2 character language codes e.g. [‘en’,’fr’]
stoplist = list, aggregated set of stopwords for languages selected
logger = logging.Logger, logger object
whitespace = str, string containing whitespace characters that will be removed prior to tokenization. default is “\u201a\u201b\u201c\u201d
punctuation = str, string containing punctuation characters that will be forced into thier own token. default is ,;\/:+-#~&*=!?
corpus_dir = str, directory where common_parse_lib language specific corpus files are located
max_gram = int, maximum size of n-grams for use in create_ngram_tokens() function. default if 4
first_names = set, aggragated language specific set of first names
lower_tokens = bool, if True text will be converted to lower() before tokenization. default is False
sent_token_seps = list, unicode sentence termination tokens. default is [\n, \r\n, \f, \u2026]
stanford_tagger_dir = base dir for stanfard POS tagger (e.g. c:stanford-postagger-full)
treetagger_tagger_dir = base dir for TreeTagger (e.g. c: reetagger)
lang_pos_mapping = dict, set of langauge to PSO tagger mappings. e.g. { ‘en’ : ‘stanford’, ‘ru’ : ‘treetagger’ }
pos_sep = tuple, POS separator character and a safe replacement. the default POS separator char is ‘/’ and usually POS tagged sentences become ‘term/POS term/POS …’. when tagging a token containing this character e.g. ‘one/two’ the POS separator character will be replaced prior to serialization to avoid an ambiguous output.
token_preservation_regex = dict of key name for regre.RegexObject objects to identify tokens that should be preserved and a unique POS token name (e.g. { ‘regex_namespace’ : ‘NAMESPACE’, ‘regex_url’ : ‘URI’ } ). POS token name must be unique for chosen POS tagger and safe for POS serialization without characters like ‘ ‘ or ‘/’. this dict argument allows additional POS tokens to be added in the future without the need to change the common_parse_lib code.
note: a config object approach is used, as opposed to a global variable, to allow common_parse_lib functions to work in a multi-threaded environment- Parameters
kwargs – variable argument to override any default config values
- Returns
configuration settings to be used by all common_parse_lib functions
- Return type
-
soton_corenlppy.common_parse_lib.
is_all_stoplist
(list_tokens, dict_common_config)[source]¶ check to see if tokens are only stoplist tokens
-
soton_corenlppy.common_parse_lib.
ngram_tokenize_microblog_text
(text, dict_common_config)[source]¶ tokenize a microblog entry (e.g. tweet) into all possible combinations of N-gram phrases keeping the linear sentence structure intact text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens. a set of all possible n-gram tokens is returned up to max-gram
- Parameters
text (unicode) – UTF-8 text to tokenize
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
- Returns
list of n-gram token sets e.g. [ [(‘one’,),(‘two’,),(‘three’,),(‘four’,)], [(‘one’,’two’), (‘two’,’three’), (‘three’,’four’)], [(‘one’,’two’,’three’),(‘two’,’three’,’four’)] ]
- Return type
-
soton_corenlppy.common_parse_lib.
parse_serialized_tagged_tree
(serialized_tree, dict_common_config)[source]¶ parse a previously serialized tree | note: tokens unescaped using replacement characters defined in list_escape_tuples
- Parameters
serialized_tree (unicode) – serialized tree structure containing POS tagged leafs from common_parse_lib.serialize_tagged_tree()
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
- Returns
tree representing POS tagged sentence e.g. (S And/CC now/RB for/IN something/NN completely/RB different/JJ)
- Return type
nltk.Tree
-
soton_corenlppy.common_parse_lib.
pos_tag_tokenset
(token_set, lang, dict_common_config, timeout=300)[source]¶ POS tag a batch of tokenized sentences for a specific langauge. it is more efficient to POS tag in large batches as the POS tagger is a separate process that must be invoked using an OS exec and a Python subprocess command. there is a fixed overhead for sub-process and pipe setup time (e.g. 1-2 seconds) so processing text in bulk is more efficient than many small separate sentences.
note: the POS tagger used is chosen from TreeTagger, Stanford and Treebank based on language codenote: URL’s and namespaces matching regex patterns provided in dict_common_config will get a POS tag of ‘URI’ or ‘NAMESPACE’ regardless of which POS tagger is usednote: tokens matching characters in dict_common_config[‘sent_token_seps’] will be labelled with a POS tag ‘NEWLINE’- Parameters
token_set (list) – list of tokens for a set of sentences. each sentence has a token set, which is itself a list of either tokenized phrase tuples or tokenized phrase strings. e.g. [ [ (‘london’,),(‘attacks’,),(‘are’,) … ], … ] e.g. [ [ ‘london’,’attacks’,’are’, … ], … ]
lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’])
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs
- Returns
list of POS tagged sentences e.g. [ [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ], … ]
- Return type
-
soton_corenlppy.common_parse_lib.
pos_tag_tokenset_batch
(document_token_set, lang, dict_common_config, max_processes=4, timeout=300)[source]¶ POS tag a batch of tokenized sentences for a specific langauge. it is more efficient to POS tag in large batches as the POS tagger is a separate process that must be invoked using an OS exec and a Python subprocess command. there is a fixed overhead for sub-process and pipe setup time (e.g. 1-2 seconds) so processing text in bulk is more efficient than many small separate sentences. use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive.
note: the POS tagger used is chosen from TreeTagger, Stanford and Treebank based on language codenote: URL’s and namespaces matching regex patterns provided in dict_common_config will get a POS tag of ‘URI’ or ‘NAMESPACE’ regardless of which POS tagger is usednote: tokens matching characters in dict_common_config[‘sent_token_seps’] will be labelled with a POS tag ‘NEWLINE’- Parameters
document_token_set (dict) – { docID : [ token_set for each document sent ] }
lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’])
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs
- Returns
dict of POS tagged documents { docID : [ tagged_token_set for each document sent ] }
- Return type
-
soton_corenlppy.common_parse_lib.
pos_tag_tokenset_batch_worker
(tuple_queue=None, lang='en', dict_common_config=None, pause_on_start=0, timeout=300, process_id=0)[source]¶ worker thread for comp_sem_lib.pos_tag_tokenset_batch()
- Parameters
tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). queueIn has tuples of ( doc_id, [ token_set for each document sent ] ). queueOut has tuples of ( doc_id, [ tagged_token_set for each document sent ] ).
lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’])
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)
timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs
process_id (int) – ID of process for logging purposes
-
soton_corenlppy.common_parse_lib.
read_pipe_stderr
(pipe_handle, queue_buffer)[source]¶ internal POS tagger process pipe callback function
- Parameters
file_handle (file) – pipe handle to POS tagger errors
queue_buffer (Queue.Queue()) – queue where pipe errors can be stored
-
soton_corenlppy.common_parse_lib.
read_pipe_stdout
(pipe_handle, queue_buffer, lines_expected=1)[source]¶ internal POS tagger process pipe callback function
- Parameters
file_handle (file) – pipe handle to POS tagger output
queue_buffer (Queue.Queue()) – queue where pipe output can be stored
lines_expected (int) – number of lines expected so we do not read other sentences from pipe
-
soton_corenlppy.common_parse_lib.
serialize_tagged_list
(list_pos, dict_common_config, serialization_style='pos')[source]¶ serialize POS tagged tokens (list) | note: the POS separator (e.g. ‘/’) is replaced in all tokens and POS tags so it is always good for a separator in the serialization
- Parameters
list_pos (list) – POS tagged sentence e.g. [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ]
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
serialization_style (str) – either POS tag list style (pos) or sentence tree style (tree). pos style is ‘and/CC now/RB …’. tree style is ‘(CC and) (RB now) …’
- Returns
serialized POS tagged sentence in style requested e.g. ‘new/NN york/NN’
- Return type
unicode
-
soton_corenlppy.common_parse_lib.
serialize_tagged_tree
(tree_sent, dict_common_config)[source]¶ serialize POS tagged tokens (tree). this function will go recursive if the tree has one or more subtrees. | note: all tokens are escaped using escape_token()
- Parameters
tree_sent (nltk.Tree) – POS tagged sentence e.g. (S And/CC now/RB for/IN something/NN completely/RB different/JJ) or (S (CC And) (RB now) … )
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
- Returns
serialized POS tagged sentence e.g. ‘(S new/NN york/NN -LRB-man-made location-RRB-/PARENTHETICAL_MATERIAL)’ or ‘(S (NP New) (NP York) (PARENTHETICAL_MATERIAL -LRB-man-made location-RRB-))’
- Return type
unicode
-
soton_corenlppy.common_parse_lib.
tokenize_sentence
(sent, dict_common_config)[source]¶ tokenizes a single sentence into stemmed tokens. if nltk.tokenize.treebank.TreebankWordTokenizer is used then tokens will be corrected for embedded punctuation within tokens and embedded periods within tokens unless they are numeric values
-
soton_corenlppy.common_parse_lib.
unescape_tagged_token
(tuple_pos)[source]¶ unescape open and close brackets in a POS token
-
soton_corenlppy.common_parse_lib.
unescape_token
(token_str)[source]¶ unescape open and close brackets in a token
- Parameters
token_str (unicode) – token text to process
- Returns
unescaped text
- Return type
unicode
-
soton_corenlppy.common_parse_lib.
unescape_tree
(tree, depth=0)[source]¶ unescape a nltk.Tree open and close brackets
- Parameters
tree (nltk.Tree) – tree to process
depth (int) – recursion depth (internal variable)
- Returns
unescaped tree
- Return type
nltk.Tree
-
soton_corenlppy.common_parse_lib.
unigram_tokenize_text
(text=None, include_char_offset=False, dict_common_config=None)[source]¶ tokenize a text entry (e.g. tweet) into unigram tokens text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens.
- Parameters
text (unicode) – UTF-8 text to tokenize
bool (include_char_offset) – if True result is a list of [ list_tokens, list_char_offset_for_tokens ]
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
- Returns
list of unigram tokens e.g. [ ‘one’,’two’,’three’ ]
- Return type
-
soton_corenlppy.common_parse_lib.
unigram_tokenize_text_with_sent_breakdown
(text=None, include_char_offset=False, dict_common_config=None)[source]¶ tokenize a microblog entry (e.g. tweet) into unigram tokens broken down into individual sents. sent token removed at end of each sent. text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens.
- Parameters
text (unicode) – UTF-8 text to tokenize
bool (include_char_offset) – if True result is a list of [ list_token_set, list_char_offset_for_token_set ]
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
- Returns
list of sents, each itself a list of unigram tokens e.g. [ [ ‘one’,’two’,’three’ ], … ]
- Return type