Text parser
About
A text parser written in the Python language.
The project has one goal, speed! See the benchmark below more details.
Project homepage: https://github.com/eerimoq/textparser
Documentation: http://textparser.readthedocs.org/en/latest
Credits
Thanks PyParsing for a user friendly interface. Many of
textparser’s class names are taken from this project.
Installation
pip install textparser
Example usage
The Hello World example parses the string Hello, World! and
outputs its parse tree ['Hello', ',', 'World', '!'].
The script:
import textparser
from textparser import Sequence
class Parser(textparser.Parser):
def token_specs(self):
return [
('SKIP', r'[ \r\n\t]+'),
('WORD', r'\w+'),
('EMARK', '!', r'!'),
('COMMA', ',', r','),
('MISMATCH', r'.')
]
def grammar(self):
return Sequence('WORD', ',', 'WORD', '!')
tree = Parser().parse('Hello, World!')
print('Tree:', tree)
Script execution:
$ env PYTHONPATH=. python3 examples/hello_world.py
Tree: ['Hello', ',', 'World', '!']
Benchmark
A benchmark comparing the speed of 10 JSON parsers, parsing a 276 kb file.
$ env PYTHONPATH=. python3 examples/benchmarks/json/speed.py
Parsed 'examples/benchmarks/json/data.json' 1 time(s) in:
PACKAGE SECONDS RATIO VERSION
textparser 0.10 100% 0.21.1
parsimonious 0.17 169% unknown
lark (LALR) 0.27 267% 0.7.0
funcparserlib 0.34 340% unknown
textx 0.54 546% 1.8.0
pyparsing 0.68 684% 2.4.0
pyleri 0.88 886% 1.2.2
parsy 0.92 925% 1.2.0
parsita 2.28 2286% unknown
lark (Earley) 2.34 2348% 0.7.0
NOTE 1: The parsers are not necessarily optimized for speed. Optimizing them will likely affect the measurements.
NOTE 2: The structure of the resulting parse trees varies and additional processing may be required to make them fit the user application.
NOTE 3: Only JSON parsers are compared. Parsing other languages may give vastly different results.
Contributing
Fork the repository.
Implement the new feature or bug fix.
Implement test case(s) to ensure that future changes do not break legacy.
Run the tests.
python3 -m unittest
Create a pull request.
The parser class
- class textparser.Parser[source]
The abstract base class of all text parsers.
>>> from textparser import Parser, Sequence >>> class MyParser(Parser): ... def token_specs(self): ... return [ ... ('SKIP', r'[ \r\n\t]+'), ... ('WORD', r'\w+'), ... ('EMARK', '!', r'!'), ... ('COMMA', ',', r','), ... ('MISMATCH', r'.') ... ] ... def grammar(self): ... return Sequence('WORD', ',', 'WORD', '!')
- token_specs()[source]
The token specifications with token name, regular expression, and optionally a user friendly name.
Two token specification forms are available;
(kind, re)or(kind, name, re). If the second form is used, the grammar should use name instead of kind.See
Parserfor an example usage.
- tokenize(text)[source]
Tokenize given string text, and return a list of tokens. Raises
TokenizeErroron failure.This method should only be called by
parse(), but may very well be overridden if the default implementation does not match the parser needs.
- grammar()[source]
The text grammar is used to create a parse tree out of a list of tokens.
See
Parserfor an example usage.
- parse(text, token_tree=False, match_sof=False)[source]
Parse given string text and return the parse tree. Raises
ParseErroron failure.Returns a parse tree of tokens if token_tree is
True.>>> MyParser().parse('Hello, World!') ['Hello', ',', 'World', '!'] >>> tree = MyParser().parse('Hello, World!', token_tree=True) >>> from pprint import pprint >>> pprint(tree) [Token(kind='WORD', value='Hello', offset=0), Token(kind=',', value=',', offset=5), Token(kind='WORD', value='World', offset=7), Token(kind='!', value='!', offset=12)]
Building the grammar
The grammar built by combining the classes below and strings.
Here is a fictitious example grammar:
grammar = Sequence(
'BEGIN',
Optional(choice('IF', Sequence(ZeroOrMore('NUMBER')))),
OneOrMore(Sequence('WORD', Not('NUMBER'))),
Any(),
DelimitedList('WORD', delim=':'),
'END')
- class textparser.Sequence(*patterns)[source]
Matches a sequence of patterns. Becomes a list in the parse tree.
- class textparser.Choice(*patterns)[source]
Matches any of given ordered patterns patterns. The first pattern in the list has highest priority, and the last lowest.
- class textparser.ChoiceDict(*patterns)[source]
Matches any of given patterns. The first token kind of all patterns must be unique, otherwise and
Errorexception is raised.This class is faster than
Choice, and should be used if the grammar allows it.
- textparser.choice(*patterns)[source]
Returns an instance of the fastest choice class for given patterns patterns. It is recommended to use this function instead of instantiate
ChoiceorChoiceDictdirectly.
- class textparser.ZeroOrMore(pattern)[source]
Matches pattern zero or more times.
See
Repeatedfor more details.
- class textparser.ZeroOrMoreDict(pattern, key=None)[source]
Matches pattern zero or more times.
See
RepeatedDictfor more details.
- class textparser.OneOrMore(pattern)[source]
Matches pattern one or more times.
See
Repeatedfor more details.
- class textparser.OneOrMoreDict(pattern, key=None)[source]
Matches pattern one or more times.
See
RepeatedDictfor more details.
- class textparser.DelimitedList(pattern, delim=',')[source]
Matches a delimented list of pattern separated by delim. pattern must be matched at least once. Any match becomes a list in the parse tree, excluding the delimiters.
- class textparser.Optional(pattern)[source]
Matches pattern zero or one times. Becomes a list in the parse tree, empty on mismatch.
- class textparser.AnyUntil(pattern)[source]
Matches any token until given pattern is found. Becomes a list in the parse tree, not including the given pattern match.
- class textparser.And(pattern)[source]
Matches pattern, without consuming any tokens. Any match becomes an empty list in the parse tree.
- class textparser.Not(pattern)[source]
Matches if pattern does not match. Any match becomes an empty list in the parse tree.
Just like
And, no tokens are consumed.
- class textparser.Tag(name, pattern)[source]
Tags any matched pattern with name name. Becomes a two-tuple of name and match in the parse tree.
- class textparser.Forward[source]
Forward declaration of a pattern.
>>> foo = Forward() >>> foo <<= Sequence('NUMBER')
- class textparser.Repeated(pattern, minimum=0)[source]
Matches pattern at least minimum times. Any match becomes a list in the parse tree.
- class textparser.RepeatedDict(pattern, minimum=0, key=None)[source]
Same as
Repeated, but becomes a dictionary instead of a list in the parse tree.key is a function taking the match as input and returning the dictionary key. By default the first element in the match is used as key.
Exceptions
- class textparser.ParseError(text, offset)[source]
This exception is raised when the parser fails to parse the text.
- property text
The input text to the parser.
- property offset
Offset into the text where the parser failed.
- property line
Line where the parser failed.
- property column
Column where the parser failed.
Utility functions
- textparser.markup_line(text, offset, marker='>>!<<')[source]
Insert marker at offset into text, and return the marked line.
>>> markup_line('0\n1234\n56', 3) 1>>!<<234
- textparser.tokenize_init(spec)[source]
Initialize a tokenizer. Should only be called by the
tokenize()method in the parser.