soton_corenlppy.common_parse_lib module¶

common parse lib supporting tokenization, POS tagging and sentence management

POS tagger information

http://www-nlp.stanford.edu/links/statnlp.html#Taggers

Standard POS tagger

http://nlp.stanford.edu/software/tagger.shtml
license = GPL v2
NLTK support for python via remote Java exec
English, Arabic, Chinese, French, German

TreeTagger

http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/Tagger-Licence
https://github.com/miotto/treetagger-python
https://courses.washington.edu/hypertxt/csar-v02/penntable.html
license = BSD style free for research/eval/teaching but NOT commercial (need to buy it for that)
NLTK support for python
German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Latin, Estonian, Polish and old French

Language codes

http://tools.ietf.org/html/bcp47

soton_corenlppy.common_parse_lib.check_retweet(original_text)[source]¶

check for rwteeets (assumes raw unprocessed text e.g. ‘RT @username …’)

Parameters: original_text (unicode) – UTF-8 text to clean
Returns: true if text contains a retweet pattern
Return type: bool

soton_corenlppy.common_parse_lib.clean_text(original_text, dict_common_config, whitespace_chars=None)[source]¶

clean a block of unicode text ready for tokenization. replace sequences of whitespace with a single space. if config[lower_tokens] = True then make text lowercase. if config[apostrophe_handling] = ‘preserve’ then ensure appos entries are preserved (even if appos is a whitespac character) if config[apostrophe_handling] = ‘strip’ then ensure appos entries are removed

Parameters

original_text (unicode) – UTF-8 text to clean
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
whitespace_chars (unicode) – whitespace characters. if None the configuration setting will be used in dict_common_config

Returns

clean text

Return type

unicode

soton_corenlppy.common_parse_lib.create_ngram_tokens(list_tokens, max_gram=4, sent_temination_tokens=None)[source]¶

compile n-gram phrase sets keeping the linear sequence of tokens intact up to a maximum gram size the optional sent_temination_tokens prevents n-gram tokens spanning sent terminator tokens (e.g. newlines)

Parameters

list_tokens (list) – unigram token list
max_gram (int) – max gram size to create
sent_temination_tokens (list) – list of sent terminator tokens

Returns

set of n-gram tokens e.g. [ [(‘one’,),(‘two’,),(‘three’,),(‘four’,)], [(‘one’,’two’), (‘two’,’three’), (‘three’,’four’)], [(‘one’,’two’,’three’),(‘two’,’three’,’four’)] ]

Return type

list

soton_corenlppy.common_parse_lib.create_sent_trees(list_pos, list_sent_addr_offsets=None, dict_common_config=None)[source]¶

create a set of nltk.Tree structures for sentences. sent delimiter characters are taken from dict_common_config[‘sent_token_seps’] and the period character

Parameters

list_pos (list) – POS tagged sentence e.g. [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ]
list_sent_addr_offsets (list) – list which (if not None) will be populated with the start address of each sent (address within the original POS tagged sent)
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of nltk.Tree sentence structures e.g. [ nltk.Tree(S And/CC now/RB for/IN something/NN completely/RB different/JJ), … ]

Return type

list

soton_corenlppy.common_parse_lib.escape_tagged_token(tuple_pos)[source]¶

escape open and close brackets in a POS token to make it nltk.Tree safe

Parameters: tuple_pos (tuple) – tuple of tagged POS entry = (token, pos)
Returns: escaped POS token = (token, pos)
Return type: tuple

soton_corenlppy.common_parse_lib.escape_token(token_str)[source]¶

escape open and close brackets in a token to make it nltk.Tree safe

Parameters: token_str (unicode) – token text to process
Returns: unescaped text
Return type: unicode

soton_corenlppy.common_parse_lib.flattern_sent(tree_sent, dict_common_config)[source]¶

flatten a sent so it has no subtrees. subtrees are flattened to make phrases. this is useful for subsequent processing that requires a tagged list, such as dependency parsing

Parameters

tree_sent (nltk.Tree) – sentence tree with any number of subtrees (S (CC And) (RB now) (VP (NP New) (NP York)) … )
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

flattened sent tree with phrases for subtrees e.g. (S (CC And) (RB now) (VP New York) … )

Return type

nltk.Tree

soton_corenlppy.common_parse_lib.flattern_tree_with_heads(tree)[source]¶

flattern a nltk.Tree preserving the head node, so all tokens are returned (unlike nltk.Tree.leaves()).

Parameters: tree (nltk.Tree) – tree to flatten
Returns: list of tokens under tree, including the head
Return type: list

soton_corenlppy.common_parse_lib.get_common_config(**kwargs)[source]¶

return a common config object for this specific set of languages. the config object contains an instantiated NLTK stemmer, tokenizer and settings tailored for the chosen language set. all available language specific corpus will be read into memory, such as stoplists. common config settings are below:

stemmer = NLTK stemmer, default is no stemming = nltk.stem.RegexpStemmer(‘’, 100000)

t_word = NLTK word tokenizer for chosen language. default is nltk.tokenize.treebank.TreebankWordTokenizer()

t_sent = NLTK sent tokenizer for chosen language. default is nltk.tokenize.treebank.PunktSentenceTokenizer()

regex_namespace = regre.RegexObject, regex to match namespaces e.g. www.bbc.co.uk

regex_url = regre.RegexObject, regex to match URIs e.g. http://www.bbc.co.uk

regex_numeric_extract = regre.RegexObject, regex to match numeric strings e.g. 56, 56.76, $54.23, $54 but NOT 52.com

lang_codes = list, list of ISO 639-1 2 character language codes e.g. [‘en’,’fr’]

stoplist = list, aggregated set of stopwords for languages selected

logger = logging.Logger, logger object

whitespace = str, string containing whitespace characters that will be removed prior to tokenization. default is “\u201a\u201b\u201c\u201d

punctuation = str, string containing punctuation characters that will be forced into thier own token. default is ,;\/:+-#~&*=!?

corpus_dir = str, directory where common_parse_lib language specific corpus files are located

max_gram = int, maximum size of n-grams for use in create_ngram_tokens() function. default if 4

first_names = set, aggragated language specific set of first names

lower_tokens = bool, if True text will be converted to lower() before tokenization. default is False

sent_token_seps = list, unicode sentence termination tokens. default is [\n, \r\n, \f, \u2026]

stanford_tagger_dir = base dir for stanfard POS tagger (e.g. c:stanford-postagger-full)

treetagger_tagger_dir = base dir for TreeTagger (e.g. c: reetagger)

lang_pos_mapping = dict, set of langauge to PSO tagger mappings. e.g. { ‘en’ : ‘stanford’, ‘ru’ : ‘treetagger’ }

pos_sep = tuple, POS separator character and a safe replacement. the default POS separator char is ‘/’ and usually POS tagged sentences become ‘term/POS term/POS …’. when tagging a token containing this character e.g. ‘one/two’ the POS separator character will be replaced prior to serialization to avoid an ambiguous output.

token_preservation_regex = dict of key name for regre.RegexObject objects to identify tokens that should be preserved and a unique POS token name (e.g. { ‘regex_namespace’ : ‘NAMESPACE’, ‘regex_url’ : ‘URI’ } ). POS token name must be unique for chosen POS tagger and safe for POS serialization without characters like ‘ ‘ or ‘/’. this dict argument allows additional POS tokens to be added in the future without the need to change the common_parse_lib code.

note: a config object approach is used, as opposed to a global variable, to allow common_parse_lib functions to work in a multi-threaded environment

Parameters: kwargs – variable argument to override any default config values
Returns: configuration settings to be used by all common_parse_lib functions
Return type: dict

soton_corenlppy.common_parse_lib.is_all_stoplist(list_tokens, dict_common_config)[source]¶

check to see if tokens are only stoplist tokens

Parameters

list_tokens (list) – list of unigram tokens
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

True if ALL tokens match the stoplist (i.e. token set is useless as a phrase)

Return type

bool

soton_corenlppy.common_parse_lib.ngram_tokenize_microblog_text(text, dict_common_config)[source]¶

tokenize a microblog entry (e.g. tweet) into all possible combinations of N-gram phrases keeping the linear sentence structure intact text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens. a set of all possible n-gram tokens is returned up to max-gram

Parameters

text (unicode) – UTF-8 text to tokenize
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of n-gram token sets e.g. [ [(‘one’,),(‘two’,),(‘three’,),(‘four’,)], [(‘one’,’two’), (‘two’,’three’), (‘three’,’four’)], [(‘one’,’two’,’three’),(‘two’,’three’,’four’)] ]

Return type

list

soton_corenlppy.common_parse_lib.parse_serialized_tagged_tree(serialized_tree, dict_common_config)[source]¶

parse a previously serialized tree | note: tokens unescaped using replacement characters defined in list_escape_tuples

Parameters

serialized_tree (unicode) – serialized tree structure containing POS tagged leafs from common_parse_lib.serialize_tagged_tree()
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

tree representing POS tagged sentence e.g. (S And/CC now/RB for/IN something/NN completely/RB different/JJ)

Return type

nltk.Tree

soton_corenlppy.common_parse_lib.pos_tag_tokenset(token_set, lang, dict_common_config, timeout=300)[source]¶

POS tag a batch of tokenized sentences for a specific langauge. it is more efficient to POS tag in large batches as the POS tagger is a separate process that must be invoked using an OS exec and a Python subprocess command. there is a fixed overhead for sub-process and pipe setup time (e.g. 1-2 seconds) so processing text in bulk is more efficient than many small separate sentences.

note: the POS tagger used is chosen from TreeTagger, Stanford and Treebank based on language code
note: URL’s and namespaces matching regex patterns provided in dict_common_config will get a POS tag of ‘URI’ or ‘NAMESPACE’ regardless of which POS tagger is used
note: tokens matching characters in dict_common_config[‘sent_token_seps’] will be labelled with a POS tag ‘NEWLINE’

Parameters

token_set (list) – list of tokens for a set of sentences. each sentence has a token set, which is itself a list of either tokenized phrase tuples or tokenized phrase strings. e.g. [ [ (‘london’,),(‘attacks’,),(‘are’,) … ], … ] e.g. [ [ ‘london’,’attacks’,’are’, … ], … ]
lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’])
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs

Returns

list of POS tagged sentences e.g. [ [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ], … ]

Return type

list

soton_corenlppy.common_parse_lib.pos_tag_tokenset_batch(document_token_set, lang, dict_common_config, max_processes=4, timeout=300)[source]¶

POS tag a batch of tokenized sentences for a specific langauge. it is more efficient to POS tag in large batches as the POS tagger is a separate process that must be invoked using an OS exec and a Python subprocess command. there is a fixed overhead for sub-process and pipe setup time (e.g. 1-2 seconds) so processing text in bulk is more efficient than many small separate sentences. use multiprocess spawning to maximize the CPU usage as this is a slow process that is CPU intensive.

note: the POS tagger used is chosen from TreeTagger, Stanford and Treebank based on language code
note: URL’s and namespaces matching regex patterns provided in dict_common_config will get a POS tag of ‘URI’ or ‘NAMESPACE’ regardless of which POS tagger is used
note: tokens matching characters in dict_common_config[‘sent_token_seps’] will be labelled with a POS tag ‘NEWLINE’

Parameters

document_token_set (dict) – { docID : [ token_set for each document sent ] }
lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’])
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs

Returns

dict of POS tagged documents { docID : [ tagged_token_set for each document sent ] }

Return type

dict

soton_corenlppy.common_parse_lib.pos_tag_tokenset_batch_worker(tuple_queue=None, lang='en', dict_common_config=None, pause_on_start=0, timeout=300, process_id=0)[source]¶

worker thread for comp_sem_lib.pos_tag_tokenset_batch()

Parameters

tuple_queue (tuple) – tuple of queue (queueIn, queueOut, queueError). queueIn has tuples of ( doc_id, [ token_set for each document sent ] ). queueOut has tuples of ( doc_id, [ tagged_token_set for each document sent ] ).
lang_codes (list) – list of ISO 639-1 2 character language codes (e.g. [‘en’,’fr’])
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
max_processes (int) – number of worker processes to spawn using multiprocessing.Process
pause_on_start (int) – number of seconds to delay thread startup before CPU intensive work begins (to allow other workers to startup also)
timeout (int) – timeout in seconds for POS tagger process in the unlikely event the POS tagger hangs
process_id (int) – ID of process for logging purposes

soton_corenlppy.common_parse_lib.read_pipe_stderr(pipe_handle, queue_buffer)[source]¶

internal POS tagger process pipe callback function

Parameters

file_handle (file) – pipe handle to POS tagger errors
queue_buffer (Queue.Queue()) – queue where pipe errors can be stored

soton_corenlppy.common_parse_lib.read_pipe_stdout(pipe_handle, queue_buffer, lines_expected=1)[source]¶

internal POS tagger process pipe callback function

Parameters

file_handle (file) – pipe handle to POS tagger output
queue_buffer (Queue.Queue()) – queue where pipe output can be stored
lines_expected (int) – number of lines expected so we do not read other sentences from pipe

soton_corenlppy.common_parse_lib.serialize_tagged_list(list_pos, dict_common_config, serialization_style='pos')[source]¶

serialize POS tagged tokens (list) | note: the POS separator (e.g. ‘/’) is replaced in all tokens and POS tags so it is always good for a separator in the serialization

Parameters

list_pos (list) – POS tagged sentence e.g. [ (‘And’, ‘CC’), (‘now’, ‘RB’), (‘for’, ‘IN’), (‘something’, ‘NN’), (‘completely’, ‘RB’), (‘different’, ‘JJ’) ]
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()
serialization_style (str) – either POS tag list style (pos) or sentence tree style (tree). pos style is ‘and/CC now/RB …’. tree style is ‘(CC and) (RB now) …’

Returns

serialized POS tagged sentence in style requested e.g. ‘new/NN york/NN’

Return type

unicode

soton_corenlppy.common_parse_lib.serialize_tagged_tree(tree_sent, dict_common_config)[source]¶

serialize POS tagged tokens (tree). this function will go recursive if the tree has one or more subtrees. | note: all tokens are escaped using escape_token()

Parameters

tree_sent (nltk.Tree) – POS tagged sentence e.g. (S And/CC now/RB for/IN something/NN completely/RB different/JJ) or (S (CC And) (RB now) … )
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

serialized POS tagged sentence e.g. ‘(S new/NN york/NN -LRB-man-made location-RRB-/PARENTHETICAL_MATERIAL)’ or ‘(S (NP New) (NP York) (PARENTHETICAL_MATERIAL -LRB-man-made location-RRB-))’

Return type

unicode

soton_corenlppy.common_parse_lib.tokenize_sentence(sent, dict_common_config)[source]¶

tokenizes a single sentence into stemmed tokens. if nltk.tokenize.treebank.TreebankWordTokenizer is used then tokens will be corrected for embedded punctuation within tokens and embedded periods within tokens unless they are numeric values

Parameters

sent (unicode) – UTF-8 text sentence to tokenize
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of unigram tokens e.g. [ ‘one’,’two’,’three’ ]

Return type

list

soton_corenlppy.common_parse_lib.unescape_tagged_token(tuple_pos)[source]¶

unescape open and close brackets in a POS token

Parameters: tuple_pos (tuple) – tuple of tagged POS entry = (token, pos)
Returns: unescaped POS token = (token, pos)
Return type: tuple

soton_corenlppy.common_parse_lib.unescape_token(token_str)[source]¶

unescape open and close brackets in a token

Parameters: token_str (unicode) – token text to process
Returns: unescaped text
Return type: unicode

soton_corenlppy.common_parse_lib.unescape_tree(tree, depth=0)[source]¶

unescape a nltk.Tree open and close brackets

Parameters

tree (nltk.Tree) – tree to process
depth (int) – recursion depth (internal variable)

Returns

unescaped tree

Return type

nltk.Tree

soton_corenlppy.common_parse_lib.unigram_tokenize_text(text=None, include_char_offset=False, dict_common_config=None)[source]¶

tokenize a text entry (e.g. tweet) into unigram tokens text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens.

Parameters

text (unicode) – UTF-8 text to tokenize
bool (include_char_offset) – if True result is a list of [ list_tokens, list_char_offset_for_tokens ]
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of unigram tokens e.g. [ ‘one’,’two’,’three’ ]

Return type

list

soton_corenlppy.common_parse_lib.unigram_tokenize_text_with_sent_breakdown(text=None, include_char_offset=False, dict_common_config=None)[source]¶

tokenize a microblog entry (e.g. tweet) into unigram tokens broken down into individual sents. sent token removed at end of each sent. text will be cleaned and tokenized. URL’s and namespaces are explicitly preserved as single tokens.

Parameters

text (unicode) – UTF-8 text to tokenize
bool (include_char_offset) – if True result is a list of [ list_token_set, list_char_offset_for_token_set ]
dict_common_config (dict) – config object returned from common_parse_lib.get_common_config()

Returns

list of sents, each itself a list of unigram tokens e.g. [ [ ‘one’,’two’,’three’ ], … ]

Return type

list

soton_corenlppy.common_parse_lib module¶

soton_corenlppy

Navigation

Related Topics