NPCRE Project

link to project page

Goal

The NPCRE project wishes to provide a sollution for projects who wishes to use the power of regular expressions, but cannot afford to pay the costs (either in time, complexity, or security) of using implementation like Perl Compatible Regular Expression library, which is using recursion, and some expressions provided to it might take exponential time to run
The project supplies a way to create a file for regular expression r, and then to use an extremely lightweight (about ten-lines of effective C code) to load the file and use it to recognize r.

Means

Design

The program is divided into two parts, the DFA creator, which creates a file that represents a certain regular expression and a DFA runner, that uses such a file to recognize a regular expression in a certain stream of bytes.

Theory

The underlying algorithms were the ones described in the excellent ``red dragon'' compilers book, by Aho Sethi and Ullman. In short we are creating DFAs out of regular expressions.

Off-the-Box usage

Please note that the user is not restricted to strings or characters. Both the DFA creator and the DFA runner can handle any sort of data. So you could use it for instance to filter some TCP requests.

Current State

Generally speaking the program is in pre-alpha stage. That is, it has currently more-or-less feature complete implementation, which is not tested enough and have many deficiency (either erroneous design which should be changed or unoptimized code) that need to be fixed.
It is underdocumented and has no obvious explanation how to operate the software.

Automata File Format

The automata file format is specified within the source files, but it is not yet permanent. There are a few matters

DFA runner

By design the program is small and has no (almost, currently it depends of a library to overcome little/big-endianness issues). It is in the npcre module in CVS. It's written in C and contains some documentation and examples.

DFA creator

Language

This is the core of the project. It is written in python (this might be change in the future due to portability and speed considerations). The python version is availible in the pynpcre CVS module.

Algorithms

It is currently able both to create DFA directly from a regular expression or to create an NFA (alg. due to Thompson) and to use the ``subset construction'' algorithm to create a DFA.
The second algorithm is obviously much less effecient and was implemented for testing purposes. The direct re2DFA algorithm will be used, and it resides in the re2dfa.py file. In the file there's an example of usage.
We also have in mindfa.py an algorithm which minimizes the DFA to the smallest possible DFA recognizing the same language. This ensures the smallest DFA exists theoretically.

Benchmarking

Here we'll test the DFA creator speed and memory usage for some interesting and big examples of regular expressions

Consecutive Repititions

Expression

If we'll repeat (a{p1})*|(a{p2})*|...|(a{pk})* (where {p} means 'a' p times) for many prime numberes the resulting DFA will have number of states as the greatest common divisor of all pi. If all numbers are prime the number of states would be the multiplication of all pi
We used the regular expression a{2}*|a{3}*|a{7}*|a{11}*|a{17}*|a{19}* that is:

(aaa)*|(aa)*|(aaaaaaaaaaa)*|(aaaaaaaaaaaaaaaaaaa)*|(aaaaaaa)*|(aaaaaaaaaaaaaaaaa)*

Results

The process took 1:17 hours of CPU time on 1Gh computer. It grossed 80 Megabytes of memory. Done on win32 using python 2.4 and process explorer to monitor resource usage.

contact

The first developer is Shachar Shemesh, for contact details check out his homepage. Unfortunately he has no time for the project now so please do not bother him about it.
The one who made the pynpcre branch and which is actively maintaining the project is Elazar Leibovich and unfortunately he does not have a homepage. You can drop him a word using email. His username at gmail is ``elazarl'' and I assume that if you're not vicious spam robots you'll figure out the actuall email address yourselves.
The author of the documentation is also Elazar Leibovich and he's wondering why do he have to refer himself in third-person.