NPCRE Project
link to project page
Goal
The NPCRE project wishes to provide a sollution for projects who wishes to use the power of regular
expressions, but cannot afford to pay the costs (either in time, complexity, or security) of using
implementation like Perl Compatible Regular Expression library, which is using recursion, and some
expressions provided to it might take exponential time to run
The project supplies a way to create a file for regular expression r, and then to use an
extremely lightweight (about ten-lines of effective C code) to load the file and use it to
recognize r.
Means
Design
The program is divided into two parts, the DFA creator, which creates a file that represents a
certain regular expression and a DFA runner, that uses such a file to recognize a regular
expression in a certain stream of bytes.
Theory
The underlying algorithms were the ones described in the excellent ``red dragon'' compilers book,
by Aho Sethi and Ullman. In short we are creating
DFAs out of regular expressions.
Off-the-Box usage
Please note that the user is not restricted to strings or characters. Both the DFA creator and the
DFA runner can handle any sort of data. So you could use it for instance to filter some TCP
requests.
Current State
Generally speaking the program is in pre-alpha stage. That is, it has currently
more-or-less feature complete implementation, which is not tested enough
and have many deficiency (either erroneous design which should be changed
or unoptimized code) that need to be fixed.
It is underdocumented and has no obvious explanation how to operate the
software.
Automata File Format
The automata file format is specified within the source files, but it is not
yet permanent. There are a few matters
DFA runner
By design the program is small and has no (almost, currently it depends of a
library to overcome little/big-endianness issues). It is in the npcre module
in CVS. It's written in C and contains some documentation and examples.
DFA creator
Language
This is the core of the project. It is written in python (this might be
change in the future due to portability and speed considerations).
The python version is availible in the pynpcre CVS module.
Algorithms
It is currently able both to create DFA directly from a regular expression
or to create an NFA (alg. due to Thompson) and to use the ``subset
construction'' algorithm to create a DFA.
The second algorithm is obviously much less effecient and was implemented for
testing purposes. The direct re2DFA algorithm will be used, and it resides
in the re2dfa.py file. In the file there's an example of usage.
We also have in mindfa.py an algorithm which minimizes the DFA to the smallest
possible DFA recognizing the same language. This ensures the smallest DFA
exists theoretically.
Benchmarking
Here we'll test the DFA creator speed and memory usage for some interesting
and big examples of regular expressions
Consecutive Repititions
Expression
If we'll repeat (a{p1})*|(a{p2})*|...|(a{pk})*
(where {p} means 'a' p times)
for many prime numberes the resulting DFA will have number of states as
the greatest common divisor of all pi. If all numbers are
prime the number of states would be the multiplication of all
pi
We used the regular expression a{2}*|a{3}*|a{7}*|a{11}*|a{17}*|a{19}*
that is:
(aaa)*|(aa)*|(aaaaaaaaaaa)*|(aaaaaaaaaaaaaaaaaaa)*|(aaaaaaa)*|(aaaaaaaaaaaaaaaaa)*
Results
The process took 1:17 hours of CPU time on 1Gh computer.
It grossed 80 Megabytes of memory.
Done on win32 using python 2.4 and process explorer to monitor resource usage.
contact
The first developer is Shachar Shemesh, for contact details check
out his homepage.
Unfortunately he has no time for the project now so please
do not bother him about it.
The one who made the pynpcre branch and which is actively maintaining
the project is Elazar Leibovich and unfortunately he does not have
a homepage. You can drop him a word using email. His username
at gmail is ``elazarl'' and I assume that if you're not vicious spam
robots you'll figure out the actuall email address yourselves.
The author of the documentation is also Elazar Leibovich and he's
wondering why do he have to refer himself in third-person.