We all love a little regex hacking now and then. I loved it enough to even write a regex matching library called libtre. The cool thing about this library is that it supports searching for approximate matches.
The approximate matching features of this library are being used for things like improving OCR results, generating “did you mean?” suggestions for users’ searches, and filtering spam. The library comes with a 2-clause BSD license so you can use it for pretty much whatever you want.
Installing the Python extension
The extension works with Python 2. Python 3 is not yet supported.
Linux, OS X, and other Unix
Although you can probably find prebuilt binaries for your system, the Python extension is usually not included. First, either install some of those binaries or compile from source:
wget http://laurikari.net/tre/tre-0.8.0.tar.bz2 tar xjvf tre-0.8.0.tar.bz2 cd tre-0.8.0 ./configure sudo make install
Then, build and install the Python extension:
cd python umask 022 sudo python setup.py install
Taking it for a test drive
The first thing you can try is run the example program included with the extension. Here’s the code for
import tre pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED) data = """ In addition to fundamental contributions in several branches of theoretical computer science, Donnald Erwin Kuth is the creator of the TeX computer typesetting system, the related METAFONT font definition language and rendering system, and the Computer Modern family of typefaces. """ fz = tre.Fuzzyness(maxerr = 3) print fz m = pt.search(data, fz) if m: print m.groups() print m
When you run
example.py the output should look like this:
tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647) ((95, 113), (99, 108), (102, 108)) Donnald Erwin Kuth
So what does this program do? First we compile the regex
Don(ald( Ervin)?)? Knuth. This simple regex matches any of the following three strings:
- Donald Ervin Knuth
- Donald Knuth
- Don Knuth
The interesting part is obviously this:
fz = tre.Fuzzyness(maxerr = 3) m = pt.search(data, fz)
This searches for an approximate match to the regex in
data, with at most three errors. The search finds “Donnald Erwin Kuth” which matches our regex within those three errors.
Some more details
tre module contents are similar to the contents of the
re module which comes with Python. Here are the key parts:
Compiles a regex pattern into a regex object. The regex object can then be used to search for approximate matches using its
A warning about regex syntax: this library uses the POSIX regex syntax, which is not 1:1 compatible with the usual Python regex syntax.
tre.Fuzzyness([delcost=1, inscost=1, subcost=1, maxcost=<unlimited>, maxdel=<unlimited>, maxins=<unlimited>, maxsub=<unlimited>, maxerr=<unlimited>])
Creates a “fuzzyness” object, used with the
search() method to restrict the number of errors allowed.
RegexObject.search(string, fuzzyness[, flags])
Searches for a match in the string with the given fuzzyness.
So, as usual with regex libraries, you first compile a regex and then use it to find matches. The fuzzyness object defines how many errors of which kinds the match is allowed to have. The different kinds of errors are:
- insertion: an extra character, like changing Donald to Donnald
- deletion: a missing character, like changing Knuth to Kuth
- substitution: a changed character, like changing Ervin to Erwin
The total number of allowed errors is controlled by the
maxerr field. This is probably enough for most applications.
For more fine-grained control, the maximum number of errors can be controlled for each error type separately with
maxsub. The allowed errors can also be limited using costs. The default cost of each error is one, but can be set to any value using
subcost. The maximum cost is set with
There’s a lot more I could say, but I’ll save my breath for further posts. I’ll just close with a modification on the by now old regex matching adage first uttered in 1997:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions with approximate matching.” Now they have, like, maybe three or so problems.
- Is Your Regex Matcher Up to Snuff?
- How to Distribute Commercial Python Applications
- Overriding System Functions for Fun and Profit