Parsing in Python
Did you ever feel the need of learning lex and yacc? I did.
I just recently found a Python module for parsing grammars: pyparsing. In contrast to traditional, parser-generating approaches, this framework doesn’t require you to learn a specific toolchain. It also doesn’t generate any code. It’s a class library: You construct your grammar by connecting objects.
When building very basic grammars, it looks very similar to the BNF . Thanks to Python’s operator overloading, it’s possible to compose parse nodes (non-terminals) using operators like + (concatenation), ^ (or) and | (match-first). Here’s what it looks like:
from pyparsing import * IntLiteral = Regex('[\\+\\-]?\\d+').setParseAction(lambda s,l,t: int(t[0])) VariableName = Regex('\\w+') EqualSign = Regex('\\s*=\\s*').suppress() WS = White().suppress() KeyValue = Group(VariableName + EqualSign + IntLiteral)
Strings can now be parsed by calling parseString() on the grammar:
self.assertEquals([['foo', 234]], KeyValue.parseString('foo=234').asList())
For my requirements, this is a very usable approach to parsing. It may not be as fast as a generated parser in C, but it’s easy to learn and takes way less time to write.
“arbitrary grammars”? Certainly not! More like “arbitrary context-free grammars” or “arbitrary unambiguous context-free grammars”, eh? Still, cool software.
Oh thanks, you’re right with that. Corrected.
Guenther -
Welcome to pyparsing! I’m glad you like the intuitive way to combine elements into more complex expressions.
Not sure whether this goes with or against your RE experience, but try to define your grammar just in terms of the non-whitespace characters. For example, you can parse “foo=42″ or “foo = 42″ with the same grammar Word(alphas) + ‘=’ + Word(nums) – pyparsing skips over whitespace by default. No need for that distracting ‘\s*’ clutter!
If you have questions, post them on the Wiki home page Discussion tab, or on the pyparsing mailing list.
– Paul