caterpillar.processing.analysis package¶
caterpillar.processing.analysis.analyse module¶
Tools to perform analysis of text streams (aka tokenizing and filtering).
- class caterpillar.processing.analysis.analyse.Analyser¶
Bases: object
Abstract base class for an analyser.
All analysers are a combination of a tokenizer and 0 or more filters. This class accesses the tokenizer by calling self.get_tokenizer() and the filters via self.get_filters(). You need to implement the get_tokenizer() method at a minimum.
This class also defines the analyse() method which will call the tokenizer followed by the filters in order before finally returning the token.
- class caterpillar.processing.analysis.analyse.BiGramAnalyser(bi_grams, stopword_list=None)¶
Bases: caterpillar.processing.analysis.analyse.Analyser
A bi-gram Analyser that behaves exactly like the DefaultAnalyser except it also makes use of a BiGramFilter.
This analyser uses a WordTokenizer in combination with a StopFilter, PositionalLowercaseWordFilter and a BiGramFilter.
Required Arguments bi_grams – a list of string n-grams to match. Passed directly to BiGramFilter.
Optional Arguments: stopword_list – A list of stop words to override the default English one.
- class caterpillar.processing.analysis.analyse.DefaultAnalyser(stopword_list=None)¶
Bases: caterpillar.processing.analysis.analyse.Analyser
The default caterpillar Analyser which mostly splits on whitespace and punctuation, except for a few special cases, and removes stopwords.
This analyzer uses a WordTokenizer in combination with a StopFilter and a PositionalLowercaseWordFilter.
Optional Arguments: stopword_list – A list of stop words to override the default stopwords.ENGLISH one.
- class caterpillar.processing.analysis.analyse.EverythingAnalyser¶
Bases: caterpillar.processing.analysis.analyse.Analyser
A EverythingAnalyser just returns the entire input string as a token.
- class caterpillar.processing.analysis.analyse.PotentialBiGramAnalyser¶
Bases: caterpillar.processing.analysis.analyse.Analyser
A PotentialBiGramAnalyser returns a list of possible bi-grams from a stream.
This analyser uses a WordTokenizer in combination with a StopFilter, PositionalLowercaseWordFilter and a PotentialBiGramFilter to generate a stream of possible bi-grams.
caterpillar.processing.analysis.filter module¶
Tools for filtering tokens.
- class caterpillar.processing.analysis.filter.BiGramFilter(bi_grams)¶
Bases: caterpillar.processing.analysis.filter.Filter
Identifies bi-grams in a token stream from a given list.
>>> rext = RegexTokenizer() >>> stream = rext("this is a bigram") >>> ngrams = BiGramFilter(["a bigram"]) >>> [token.value for token in bigrams(stream)] ["this", "is, "a bigram"]
Required Arguments bi_grams – A list of n-gram strings to match.
- class caterpillar.processing.analysis.filter.Filter¶
Bases: object
Base class for Filter objects.
A Filter subclass must implement a filter() method that takes a single argument, which is an iterator of Token objects, and yield a series of Token objects in return.
- filter(tokens)¶
Return filtered tokens from the tokens iterator.
- class caterpillar.processing.analysis.filter.LowercaseFilter¶
Bases: caterpillar.processing.analysis.filter.Filter
Uses unicode.lower() to lowercase token text.
>>> rext = RegexTokenizer() >>> stream = rext("This is a TEST") >>> [token.value for token in LowercaseFilter(stream)] ["this", "is", "a", "test"]
- class caterpillar.processing.analysis.filter.PassFilter¶
Bases: caterpillar.processing.analysis.filter.Filter
An identity filter: passes the tokens through untouched.
- class caterpillar.processing.analysis.filter.PositionalLowercaseWordFilter(position)¶
Bases: caterpillar.processing.analysis.filter.Filter
Uses unicode.lower() to lowercase single word tokens that appear to be titles and are in a certain position.
Only applies the filter if the token only contains only 1 word (ie no spaces) and unicode.istitle() returns true. Used internally to force the case of words appearing at the start of a sentence to lower if it isn’t part of a compound name.
>>> rext = RegexTokenizer() >>> stream = rext("This is a TEST") >>> [token.value for token in LowercaseFilter(0).filter()] ["this", "is", "a", "TEST"]
Required Arguments: position – A 0 based int index the token needs to be in to apply this filter.
- class caterpillar.processing.analysis.filter.PotentialBiGramFilter¶
Bases: caterpillar.processing.analysis.filter.Filter
Identifies bi-grams in a token stream along with regular tokens.
Potential bi-grams won’t include stopped tokens or names.
WARNING: This filter differs from most other filters in that it returns a list of token objects, not just single tokens. This is done purely for performance. If it finds a non-bi-gram, a list of just a single token is returned.
- class caterpillar.processing.analysis.filter.StopFilter(stoplist, minsize=3)¶
Bases: caterpillar.processing.analysis.filter.Filter
Marks “stop” words (words too common to index) in the stream.
>>> rext = RegexTokenizer() >>> stream = rext("this is a test") >>> stopper = StopFilter() >>> [token.value for token in stopper(stream)] ["this", "test"]
Required Arguments stoplist – A list of stop words lower cased.
Optional Arguments minsize – An int indicating the smallest acceptable word length.
- class caterpillar.processing.analysis.filter.SubstitutionFilter(pattern, replacement)¶
Bases: caterpillar.processing.analysis.filter.Filter
Performs a regular expression substitution on the token text.
This is especially useful for removing text from tokens, for example hyphens:
ana = RegexTokenizer(r"\S+") | SubstitutionFilter("-", "")
Because it has the full power of the regex.sub() method behind it, this filter can perform some fairly complex transformations. For example, to take tokens like 'a=b', 'c=d', 'e=f' and change them to 'b=a', 'd=c', 'f=e':
>>> sf = SubstitutionFilter("([^/]*)/(./*)", r"\2/\1")
Required Arguments pattern – A pattern string or compiled regular expression object describing the text to replace. replacement – A string of substitution text.
caterpillar.processing.analysis.stopwords module¶
- caterpillar.processing.analysis.stopwords.parse_stopwords(stopwords_file)¶
Parse stopwords from a plain text file.
Expects a single stopword on every line.
caterpillar.processing.analysis.tokenize module¶
Tools to tokenize text.
- class caterpillar.processing.analysis.tokenize.EverythingTokenizer¶
Bases: caterpillar.processing.analysis.tokenize.Tokenizer
Returns entire input string as a single token generator.
- class caterpillar.processing.analysis.tokenize.ParagraphTokenizer¶
Bases: caterpillar.processing.analysis.tokenize.RegexpTokenizer
Tokenize a string into paragraphs.
This is accomplished by treating any sentence break character plus any text (ie not a space) followed by a new line character as the end of a paragraph.
Because of titles etc., we also treat any two or more consecutive newline characters as a paragraph break.
- class caterpillar.processing.analysis.tokenize.RegexpTokenizer(pattern, gaps=False, flags=312)¶
Bases: caterpillar.processing.analysis.tokenize.Tokenizer
A Tokenizer that splits a string using a regular expression.
This class can be used to match either the tokens or the separators between tokens.
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
This class uses the new python regex module instead of the existing re module. This means unicode codepoint properties are supported.
>>> tokenizer = RegexpTokenizer('\w+|\$[\p{N}\.]+|\S+')
Required Arguments pattern – A str used to build this tokenizer. This pattern may safely contain grouping parentheses.
Optional Arguments gaps – A bool indicating this tokenizer’s pattern should be used to find separators between tokens; False if this
tokenizer’s pattern should be used to find the tokens themselves. Defaults to False.- flags – A int mask of regexp flags used to compile this tokenizer’s pattern. By default, the following flags are
- used: regex.UNICODE | regex.MULTILINE | regex.DOTALL | regex.VERSION1.
- tokenize(value)¶
Perform the tokenizing.
Required Argument value – The unicode string to tokenize.
- class caterpillar.processing.analysis.tokenize.Token(value=None, position=None, stopped=None, index=None)¶
Bases: object
A class representing a “token” (usually a word) extracted from the source text being analysed.
Because object instantiation in Python is slow all tokenizers use this class as singleton. This means that ONE SINGLE Token object is YIELD OVER AND OVER by tokenizers, changing the attributes each time.
This trick means that consumers of tokens (i.e. filters) must never try to hold onto the token object between loop iterations, or convert the token generator into a list. Instead, save the attributes between iterations, not the object!
- copy()¶
Return a deep copy of this object.
- update(value, stopped=False, position=None, index=None)¶
Re-initialise this token instance with the passed values.
Required Arguments value – The unicode value of this Token.
Optional Arguments stopped – A bool indicating if this token was stopped by a filter. position – A int indicating the original position in the stream. This is a 0 based index. index – A tuple of two ints indicating the start and end index of this token in the original stream.
Returns this token.
- class caterpillar.processing.analysis.tokenize.Tokenizer¶
Bases: object
Abstract base class for all Tokenizers.
Forces all implementers to implement a tokenize() method.
- class caterpillar.processing.analysis.tokenize.WordTokenizer(detect_compound_names=True)¶
Bases: caterpillar.processing.analysis.tokenize.RegexpTokenizer
Tokenize a string into words.
This Tokenizer contains a bunch of special logic that in an ideal world would be separated out into filters. Unfortunately this isn’t an ideal world and filters can be slow. The tokenizer will:
- return compound names as a single token;
- keep emails in-tact as a single token;
- keep contractions except possessives as a single token;
- return numbers, including decimal numbers as a single token; and
- break the text into words based on whitespace separation and punctuation (excepted where stated);