SimpleParse is a BSD-licensed Python package providing
a simple parser generator for use with the mxTextTools
text-tagging engine. SimpleParse allows you to generate tagging
tables for use with the text-tagging engine directly from your
EBNF grammar.
Unlike most parser generators, SimpleParse generates single-pass parsers
(there is no distinct tokenization stage), an approach taken from
the predecessor project (mcf.pars) which attempted to create "autonomously
parsing regex objects". The resulting parsers are not as generalized
as those created by, for instance, the Earley algorithm, but they
do tend to be useful for the parsing of computer file formats and the
like (as distinct from natural language and similar "hard" parsing problems).
In addition to the parser generator, the SimpleParse project
includes a sub-project to create a modified version of the
mxTextTools engine which reorganizes the code to allow for certain
common EBNF constructs.
For those interested in working on the project, I'm actively interested in welcoming and supporting both new developers and new users. Feel free to contact me.
You will need a copy of Python with distutils support (Python versions 2.0 and above include this). If you want to build the non-recursive TextTools engine, you'll also need a C compiler compatible with your Python build and understood by distutils.
To install the base SimpleParse engine, download the latest version in your preferred format. If you are using the Win32 installer, simply run the executable. If you are using one of the source distributions, unpack the distribution into a temporary directory (maintaining the directory structure) then run:
setup.py install
in the top directory created by the expansion process.
You will want the mxBase 2.1.0 distribution to run SimpleParse. This package should be available in all the standard formats, follow the same instructions as for the SimpleParse package to install. If you want to use the non-recursive implementation, you will need to get the source archive. It is possible to use mxBase 2.0.3 with SimpleParse, but not to use it for building the non-recursive TextTools engine. Note: without the non-recursive rewrite of 2.1.0, the test suite will not pass all tests. A single test (which is tested with a number of different versions of the simpleparse grammar) will fail. I'm not sure why it fails with the recursive version, but it does argue for using the non-recursive rewrite.
To build the non-recursive TextTools engine, you'll need to get
the source distribution for the non-recursive implementation from the
SimpleParse
file repository. This archive is intended to be expanded
over the mxBase source archive from the top-level directory (it was
created with the 2.1.0 beta1 distribution specifically), replacing
one file and adding four others.
cd egenix-mx-base-2.1.0
gunzip non-recursive-1.0.0b1.tar.gz
tar -xvf non-recursive-1.0.0b1.tar
(Or use WinZip on Windows). When you have completed that, run:
setup.py build --force install
in the top directory of the eGenix-mx-base source tree. It is hoped that eventually the non-recursive rewrite will be folded into the eGenix-mx-base distribution so this extra step won't be necessary.
New in 2.0:
General
Our (current) parsers are top-down, in that they work from the top of
the parsing graph (the root production). They are not, however, tokenising
parsers, so there is no appropriate LL(x) designation as far as I can see,
and there is an arbitrary lookahead mechanism that could theoretically parse
the entire rest of the file just to see if a particular character matches).
I would hazard a guess that they are theoretically closest to a deterministic
recursive-descent parser.
There are no backtracking facilities, so any ambiguity is handled by choosing
the first successful match of a grammar (not the longest, as in most top-down
parsers, mostly because without tokenisation, it would be expensive to do
checks for each possible match's length). As a result of this, the parsers
are entirely deterministic.
The time/memory characteristics are such that, in general, the time to
parse an input text varies with the amount of text to parse. There are two
major factors, the time to do the actual parsing (which, for simple deterministic
grammars should be close to linear with the length of the text, though a pathalogical
grammar might have radically different operating characteristics) and the
time to build the results tree (which depends on the memory architecture of
the machine, the currently free memory, and the phase of the moon). As a
rule, SimpleParse parsers will be faster (for suitably limited grammars) than
anything you can code directly in Python. They will not generally outperform
grammar-specific parsers written in C.
mxTextTools Rewrite Enhancements
Alternate C Back-end?
© 1998-2002, Copyright by Mike C. Fletcher; All Rights Reserved.
mailto: mcfletch@users.sourceforge.net
Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee or royalty is hereby granted, provided that the above copyright notice appear in all copies and that both the copyright notice and this permission notice appear in supporting documentation or portions thereof, including modifications, that you make.
THE AUTHOR MIKE C. FLETCHER DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS
SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS, IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL,
INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
WITH THE USE OR PERFORMANCE OF THIS SOFTWARE!
A
Open Source project