soton_corenlppy.common_parse_lib module

common parse lib supporting tokenization, POS tagging and sentence management

Standard POS tagger
license = GPL v2
NLTK support for python via remote Java exec
English, Arabic, Chinese, French, German
TreeTagger
license = BSD style free for research/eval/teaching but NOT commercial (need to buy it for that)
NLTK support for python
German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Latin, Estonian, Polish and old French
soton_corenlppy.common_parse_lib.check_retweet(original_text)[source]

check for rwteeets (assumes raw unprocessed text e.g. ‘RT @username …’)

Parameters

original_text (unicode) – UTF-8 text to clean

Returns

true if text contains a retweet pattern

Return type

bool

soton_corenlppy.common_parse_lib.clean_text(original_text, dict_common_config, whitespace_chars=None)[source]

clean a block of unicode text ready for tokenization. replace sequences of whitespace with a single space. if config[lower_tokens] = True then make text lowercase. if config[apostrophe_handling] = ‘preserve’ then ensure appos entries are preserved (even if appos is a whitespac character) if config[apostrophe_handling] = ‘strip’ then ensure appos entries are removed

Parameters
  • original_text (unicode) – UTF-8 text to clean

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

  • whitespace_chars (unicode) – whitespace characters. if None the configuration setting will be used in dict_common_config

Returns

clean text

Return type

unicode

soton_corenlppy.common_parse_lib.create_ngram_tokens(list_tokens, max_gram=4, sent_temination_tokens=None)[source]

compile n-gram phrase sets keeping the linear sequence of tokens intact up to a maximum gram size the optional sent_temination_tokens prevents n-gram tokens spanning sent terminator tokens (e.g. newlines)

Parameters
  • list_tokens (list) – unigram token list

  • max_gram (int) – max gram size to create

  • sent_temination_tokens (list) – list of sent terminator tokens

Returns

set of n-gram tokens e.g. [ [(‘one’,),(‘two’,),(‘three’,),(‘four’,)], [(‘one’,’two’), (‘two’,’three’), (‘three’,’four’)], [(‘one’,’two’,’three’),(‘two’,’three’,’four’)] ]

Return type

list

soton_corenlppy.common_parse_lib.create_sent_trees(list_pos, list_sent_addr_offsets=None, dict_common_config=None)[source]

create a set of nltk.Tree structures for sentences. sent delimiter characters are taken from dict_common_config[‘sent_token_seps’] and the period character

Parameters
  • list_pos (list) – POS tagged sentence e.g. [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ]

  • list_sent_addr_offsets (list) – list which (if not None) will be populated with the start address of each sent (address within the original POS tagged sent)

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of nltk.Tree sentence structures e.g. [ nltk.Tree(S And/CC now/RB for/IN something/NN completely/RB different/JJ), … ]

Return type

list

soton_corenlppy.common_parse_lib.escape_tagged_token(tuple_pos)[source]

escape open and close brackets in a POS token to make it nltk.Tree safe

Parameters

tuple_pos (tuple) – tuple of tagged POS entry = (token, pos)

Returns

escaped POS token = (token, pos)

Return type

tuple

soton_corenlppy.common_parse_lib.escape_token(token_str)[source]

escape open and close brackets in a token to make it nltk.Tree safe

Parameters

token_str (unicode) – token text to process

Returns

unescaped text

Return type

unicode

soton_corenlppy.common_parse_lib.flattern_sent(tree_sent, dict_common_config)[source]

flatten a sent so it has no subtrees. subtrees are flattened to make phrases. this is useful for subsequent processing that requires a tagged list, such as dependency parsing

Parameters
  • tree_sent (nltk.Tree) – sentence tree with any number of subtrees (S (CC And) (RB now) (VP (NP New) (NP York)) … )

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

flattened sent tree with phrases for subtrees e.g. (S (CC And) (RB now) (VP New York) … )

Return type

nltk.Tree

soton_corenlppy.common_parse_lib.flattern_tree_with_heads(tree)[source]

flattern a nltk.Tree preserving the head node, so all tokens are returned (unlike nltk.Tree.leaves()).

Parameters

tree (nltk.Tree) – tree to flatten

Returns

list of tokens under tree, including the head

Return type

list

soton_corenlppy.common_parse_lib.get_common_config(**kwargs)[source]

return a common config object for this specific set of languages. the config object contains an instantiated NLTK stemmer, tokenizer and settings tailored for the chosen language set. all available language specific corpus will be read into memory, such as stoplists. common config settings are below:

  • stemmer = NLTK stemmer, default is no stemming = nltk.stem.RegexpStemmer(‘’, 100000)

  • t_word = NLTK word tokenizer for chosen language. default is nltk.tokenize.treebank.TreebankWordTokenizer()

  • t_sent = NLTK sent tokenizer for chosen language. default is nltk.tokenize.treebank.PunktSentenceTokenizer()

  • regex_namespace = regre.RegexObject, regex to match namespaces e.g. www.bbc.co.uk

  • regex_url = regre.RegexObject, regex to match URIs e.g. http://www.bbc.co.uk

  • regex_numeric_extract = regre.RegexObject, regex to match numeric strings e.g. 56, 56.76, $54.23, $54 but NOT 52.com

  • lang_codes = list, list of ISO 639-1 2 character language codes e.g. [‘en’,’fr’]

  • stoplist = list, aggregated set of stopwords for languages selected

  • logger = logging.Logger, logger object

  • whitespace = str, string containing whitespace characters that will be removed prior to tokenization. default is “\u201a\u201b\u201c\u201d

  • punctuation = str, string containing punctuation characters that will be forced into thier own token. default is ,;\/:+-#~&*=!?

  • corpus_dir = str, directory where common_parse_lib language specific corpus files are located

  • max_gram = int, maximum size of n-grams for use in create_ngram_tokens() function. default if 4

  • first_names = set, aggragated language specific set of first names

  • lower_tokens = bool, if True text will be converted to lower() before tokenization. default is False

  • sent_token_seps = list, unicode sentence termination tokens. default is [\n, \r\n, \f, \u2026]

  • stanford_tagger_dir = base dir for stanfard POS tagger (e.g. c:stanford-postagger-full)

  • treetagger_tagger_dir = base dir for TreeTagger (e.g. c: reetagger)

  • lang_pos_mapping = dict, set of langauge to PSO tagger mappings. e.g. { ‘en’ : ‘stanford’, ‘ru’ : ‘treetagger’ }

  • pos_sep = tuple, POS separator character and a safe replacement. the default POS separator char is ‘/’ and usually POS tagged sentences become ‘term/POS term/POS …’. when tagging a token containing this character e.g. ‘one/two’ the POS separator character will be replaced prior to serialization to avoid an ambiguous output.

  • token_preservation_regex = dict of key name for regre.RegexObject objects to identify tokens that should be preserved and a unique POS token name (e.g. { ‘regex_namespace’ : ‘NAMESPACE’, ‘regex_url’ : ‘URI’ } ). POS token name must be unique for chosen POS tagger and safe for POS serialization without characters like ‘ ‘ or ‘/’. this dict argument allows additional POS tokens to be added in the future without the need to change the common_parse_lib code.

note: a config object approach is used, as opposed to a global variable, to allow common_parse_lib functions to work in a multi-threaded environment
Parameters

kwargs – variable argument to override any default config values

Returns

configuration settings to be used by all common_parse_lib functions

Return type

dict

soton_corenlppy.common_parse_lib.is_all_stoplist(list_tokens, dict_common_config)[source]

check to see if tokens are only stoplist tokens

Parameters
  • list_tokens (list) – list of unigram tokens

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

True if ALL tokens match the stoplist (i.e. token set is useless as a phrase)

Return type

bool

soton_corenlppy.common_parse_lib.ngram_tokenize_microblog_text(text, dict_common_config)[source]

tokenize a microblog entry (e.g. tweet) into all possible combinations of N-gram phrases keeping the linear sentence structure intact text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens. a set of all possible n-gram tokens is returned up to max-gram

Parameters
  • text (unicode) – UTF-8 text to tokenize

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of n-gram token sets e.g. [ [(‘one’,),(‘two’,),(‘three’,),(‘four’,)], [(‘one’,’two’), (‘two’,’three’), (‘three’,’four’)], [(‘one’,’two’,’three’),(‘two’,’three’,’four’)] ]

Return type

list

soton_corenlppy.common_parse_lib.parse_serialized_tagged_tree(serialized_tree, dict_common_config)[source]

parse a previously serialized tree | note: tokens unescaped using replacement characters defined in list_escape_tuples

Parameters
  • serialized_tree (unicode) – serialized tree structure containing POS tagged leafs from common_parse_lib.serialize_tagged_tree()

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

tree representing POS tagged sentence e.g. (S And/CC now/RB for/IN something/NN completely/RB different/JJ)

Return type

nltk.Tree

soton_corenlppy.common_parse_lib.pos_tag_tokenset(token_set, lang, dict_common_config, timeout=300)[source]

POS tag a batch of tokenized sentences for a specific langauge. it is more efficient to POS tag in large batches as the POS tagger is a separate process that must be invoked using an OS exec and a Python subprocess command. there is a fixed overhead for sub-process and pipe setup time (e.g. 1-2 seconds) so processing text in bulk is more efficient than many small separate sentences.

note: the POS tagger used is chosen from TreeTagger, Stanford and Treebank based on language code
note: URL’s and namespaces matching regex patterns provided in dict_common_config will get a POS tag of ‘URI’ or ‘NAMESPACE’ regardless of which POS tagger is used
note: tokens matching characters in dict_common_config[‘sent_token_seps’] will be labelled with a POS tag ‘NEWLINE’
Parameters
  • token_set (list) – list of tokens for a set of sentences. each sentence has a token set, which is itself a list of either tokenized phrase tuples or tokenized phrase strings. e.g. [ [ (‘london’,),(‘attacks’,),(‘are’,) … ], … ] e.g. [ [ ‘london’,’attacks’,’are’, … ], … ]

  • lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’])

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

  • timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs

Returns

list of POS tagged sentences e.g. [ [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ], … ]

Return type

list

soton_corenlppy.common_parse_lib.pos_tag_tokenset_batch(document_token_set, lang, dict_common_config, max_processes=4, timeout=300)[source]

POS tag a batch of tokenized sentences for a specific langauge. it is more efficient to POS tag in large batches as the POS tagger is a separate process that must be invoked using an OS exec and a Python subprocess command. there is a fixed overhead for sub-process and pipe setup time (e.g. 1-2 seconds) so processing text in bulk is more efficient than many small separate sentences. use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive.

note: the POS tagger used is chosen from TreeTagger, Stanford and Treebank based on language code
note: URL’s and namespaces matching regex patterns provided in dict_common_config will get a POS tag of ‘URI’ or ‘NAMESPACE’ regardless of which POS tagger is used
note: tokens matching characters in dict_common_config[‘sent_token_seps’] will be labelled with a POS tag ‘NEWLINE’
Parameters
  • document_token_set (dict) – { docID : [ token_set for each document sent ] }

  • lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’])

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

  • max_processes (int) – number of worker processes to spawn using multiprocessing.Process

  • timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs

Returns

dict of POS tagged documents { docID : [ tagged_token_set for each document sent ] }

Return type

dict

soton_corenlppy.common_parse_lib.pos_tag_tokenset_batch_worker(tuple_queue=None, lang='en', dict_common_config=None, pause_on_start=0, timeout=300, process_id=0)[source]

worker thread for comp_sem_lib.pos_tag_tokenset_batch()

Parameters
  • tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). queueIn has tuples of ( doc_id, [ token_set for each document sent ] ). queueOut has tuples of ( doc_id, [ tagged_token_set for each document sent ] ).

  • lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’])

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

  • max_processes (int) – number of worker processes to spawn using multiprocessing.Process

  • pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)

  • timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs

  • process_id (int) – ID of process for logging purposes

soton_corenlppy.common_parse_lib.read_pipe_stderr(pipe_handle, queue_buffer)[source]

internal POS tagger process pipe callback function

Parameters
  • file_handle (file) – pipe handle to POS tagger errors

  • queue_buffer (Queue.Queue()) – queue where pipe errors can be stored

soton_corenlppy.common_parse_lib.read_pipe_stdout(pipe_handle, queue_buffer, lines_expected=1)[source]

internal POS tagger process pipe callback function

Parameters
  • file_handle (file) – pipe handle to POS tagger output

  • queue_buffer (Queue.Queue()) – queue where pipe output can be stored

  • lines_expected (int) – number of lines expected so we do not read other sentences from pipe

soton_corenlppy.common_parse_lib.serialize_tagged_list(list_pos, dict_common_config, serialization_style='pos')[source]

serialize POS tagged tokens (list) | note: the POS separator (e.g. ‘/’) is replaced in all tokens and POS tags so it is always good for a separator in the serialization

Parameters
  • list_pos (list) – POS tagged sentence e.g. [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ]

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

  • serialization_style (str) – either POS tag list style (pos) or sentence tree style (tree). pos style is ‘and/CC now/RB …’. tree style is ‘(CC and) (RB now) …’

Returns

serialized POS tagged sentence in style requested e.g. ‘new/NN york/NN’

Return type

unicode

soton_corenlppy.common_parse_lib.serialize_tagged_tree(tree_sent, dict_common_config)[source]

serialize POS tagged tokens (tree). this function will go recursive if the tree has one or more subtrees. | note: all tokens are escaped using escape_token()

Parameters
  • tree_sent (nltk.Tree) – POS tagged sentence e.g. (S And/CC now/RB for/IN something/NN completely/RB different/JJ) or (S (CC And) (RB now) … )

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

serialized POS tagged sentence e.g. ‘(S new/NN york/NN -LRB-man-made location-RRB-/PARENTHETICAL_MATERIAL)’ or ‘(S (NP New) (NP York) (PARENTHETICAL_MATERIAL -LRB-man-made location-RRB-))’

Return type

unicode

soton_corenlppy.common_parse_lib.tokenize_sentence(sent, dict_common_config)[source]

tokenizes a single sentence into stemmed tokens. if nltk.tokenize.treebank.TreebankWordTokenizer is used then tokens will be corrected for embedded punctuation within tokens and embedded periods within tokens unless they are numeric values

Parameters
  • sent (unicode) – UTF-8 text sentence to tokenize

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of unigram tokens e.g. [ ‘one’,’two’,’three’ ]

Return type

list

soton_corenlppy.common_parse_lib.unescape_tagged_token(tuple_pos)[source]

unescape open and close brackets in a POS token

Parameters

tuple_pos (tuple) – tuple of tagged POS entry = (token, pos)

Returns

unescaped POS token = (token, pos)

Return type

tuple

soton_corenlppy.common_parse_lib.unescape_token(token_str)[source]

unescape open and close brackets in a token

Parameters

token_str (unicode) – token text to process

Returns

unescaped text

Return type

unicode

soton_corenlppy.common_parse_lib.unescape_tree(tree, depth=0)[source]

unescape a nltk.Tree open and close brackets

Parameters
  • tree (nltk.Tree) – tree to process

  • depth (int) – recursion depth (internal variable)

Returns

unescaped tree

Return type

nltk.Tree

soton_corenlppy.common_parse_lib.unigram_tokenize_text(text=None, include_char_offset=False, dict_common_config=None)[source]

tokenize a text entry (e.g. tweet) into unigram tokens text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens.

Parameters
  • text (unicode) – UTF-8 text to tokenize

  • bool (include_char_offset) – if True result is a list of [ list_tokens, list_char_offset_for_tokens ]

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of unigram tokens e.g. [ ‘one’,’two’,’three’ ]

Return type

list

soton_corenlppy.common_parse_lib.unigram_tokenize_text_with_sent_breakdown(text=None, include_char_offset=False, dict_common_config=None)[source]

tokenize a microblog entry (e.g. tweet) into unigram tokens broken down into individual sents. sent token removed at end of each sent. text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens.

Parameters
  • text (unicode) – UTF-8 text to tokenize

  • bool (include_char_offset) – if True result is a list of [ list_token_set, list_char_offset_for_token_set ]

  • dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of sents, each itself a list of unigram tokens e.g. [ [ ‘one’,’two’,’three’ ], … ]

Return type

list