We all love a little regex hacking now and then. I loved it enough to even write a regex matching library called libtre. The cool thing about this library is that it supports searching for approximate matches.
The approximate matching features of this library are being used for things like improving OCR results, generating “did you mean?” suggestions for users’ searches, and filtering spam. The library comes with a 2-clause BSD license so you can use it for pretty much whatever you want.
Installing the Python extension
The extension works with Python 2. Python 3 is not yet supported.
Windows
Install the prebuilt Python extension: tre-0.8.0.win32-py2.6.msi. The package works with Python 2.6 from python.org.
Linux, OS X, and other Unix
Although you can probably find prebuilt binaries for your system, the Python extension is usually not included. First, either install some of those binaries or compile from source:
wget http://laurikari.net/tre/tre-0.8.0.tar.bz2 tar xjvf tre-0.8.0.tar.bz2 cd tre-0.8.0 ./configure sudo make install
Then, build and install the Python extension:
cd python umask 022 sudo python setup.py install
Taking it for a test drive
The first thing you can try is run the example program included with the extension. Here’s the code for example.py
:
import tre pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED) data = """ In addition to fundamental contributions in several branches of theoretical computer science, Donnald Erwin Kuth is the creator of the TeX computer typesetting system, the related METAFONT font definition language and rendering system, and the Computer Modern family of typefaces. """ fz = tre.Fuzzyness(maxerr = 3) print fz m = pt.search(data, fz) if m: print m.groups() print m[0]
When you run example.py
the output should look like this:
tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647) ((95, 113), (99, 108), (102, 108)) Donnald Erwin Kuth
So what does this program do? First we compile the regex Don(ald( Ervin)?)? Knuth
. This simple regex matches any of the following three strings:
- Donald Ervin Knuth
- Donald Knuth
- Don Knuth
The interesting part is obviously this:
fz = tre.Fuzzyness(maxerr = 3) m = pt.search(data, fz)
This searches for an approximate match to the regex in data
, with at most three errors. The search finds “Donnald Erwin Kuth” which matches our regex within those three errors.
Some more details
The tre
module contents are similar to the contents of the re
module which comes with Python. Here are the key parts:
tre.compile(pattern[, flags])
Compiles a regex pattern into a regex object. The regex object can then be used to search for approximate matches using its search()
method.
A warning about regex syntax: this library uses the POSIX regex syntax, which is not 1:1 compatible with the usual Python regex syntax.
tre.Fuzzyness([delcost=1, inscost=1, subcost=1, maxcost=<unlimited>, maxdel=<unlimited>, maxins=<unlimited>, maxsub=<unlimited>, maxerr=<unlimited>])
Creates a “fuzzyness” object, used with the search()
method to restrict the number of errors allowed.
RegexObject.search(string, fuzzyness[, flags])
Searches for a match in the string with the given fuzzyness.
So, as usual with regex libraries, you first compile a regex and then use it to find matches. The fuzzyness object defines how many errors of which kinds the match is allowed to have. The different kinds of errors are:
- insertion: an extra character, like changing Donald to Donnald
- deletion: a missing character, like changing Knuth to Kuth
- substitution: a changed character, like changing Ervin to Erwin
The total number of allowed errors is controlled by the maxerr
field. This is probably enough for most applications.
For more fine-grained control, the maximum number of errors can be controlled for each error type separately with maxins
, maxdel
, and maxsub
. The allowed errors can also be limited using costs. The default cost of each error is one, but can be set to any value using inscost
, delcost
, and subcost
. The maximum cost is set with maxcost
.
There’s a lot more I could say, but I’ll save my breath for further posts. I’ll just close with a modification on the by now old regex matching adage first uttered in 1997:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions with approximate matching.” Now they have, like, maybe three or so problems.
Related posts:
- Is Your Regex Matcher Up to Snuff?
- How to Distribute Commercial Python Applications
- Overriding System Functions for Fun and Profit
If you liked this, click here to receive new posts in a reader.
You should also follow me on Twitter here.
Comments on this entry are closed.
{ 21 comments }
Approximate Regex Matching in Python: http://bit.ly/37C4X6
This comment was originally posted on Twitter
Development Tips: “Approximate Regex Matching in Python” ( http://bit.ly/eLuKx )
This comment was originally posted on Twitter
Approximate Regex Matching in Python by @hashedbits http://ff.im/-8wyEa
This comment was originally posted on Twitter
Approximate Regular Expression Matching in Python : http://bit.ly/UsFLu #python #regularexpressions
This comment was originally posted on Twitter
Appropriate regex matching library in Python by Ville Laurikari. http://bit.ly/MP6M5
This comment was originally posted on Twitter
I’ll be sure to try it out next time I’ll be doing approximate text parsing. I’ve done it a few times for “half natural half formal” languages, and I can see how this tool could help.
I wonder if and how it should be integrated with nltk ( http://www.nltk.org )
Nice, thank you ..
This comment was originally posted on Reddit
Nice library, thank you! :)
Approximate Regex Matching in Python http://bit.ly/UsFLu
This comment was originally posted on Twitter
Tired of exacting regex? Go for approximation: http://bit.ly/UsFLu
This comment was originally posted on Twitter
Approximate Regex Matching in Python http://j.mp/11rITI
This comment was originally posted on Twitter
推è网文:《Approximate Regex Matching in Python》(pythonè¯è¨€çš„è¿‘ä¼¼æ£åˆ™åŒ¹é…)http://bit.ly/1pOWyC #regex #æ£åˆ™#
This comment was originally posted on Twitter
This will be used for Bioinformatics by me.
How fast is it, compared to the regular regexps?
Very cool….
Could you give some more examples with words?
Why isn’t this working?
qm = tre.compile("(Quantum (Mechanics|Physics))" , tre.EXTENDED )
?
@Ivan, it does work for me; the regex compiles fine and matches either “Quantum Mechanics” or “Quantum Physics”. What’s the problem, exactly?
@Joseph, it’s a bit slower (by constant factors), but can still be quite fast. I have no benchmark results to show.
Ok cool.
I forgot that search() and findall() aren’t the same thing ;)
Is there a way to add docs to a python C module?
I am lost without my ipython tools:
In [47]: qm.search?
Docstring: try to match against given string,
returning tre.match object or None on failure
In [48]: %pdef qm.search
No definition header found for qm.search
I am thinking of using re objects for “smart tags” that unite multiple versions of the tag into a compact expression that a RE-literate person can interpret.
Thx for your help.
The doc strings are defined in the C file, AFAIK not possible to edit without recompiling. Perhaps a small .py wrapper should be written around the raw, almost 1:1, mapping for a more pythonic interface. That file would also be a good home for doc strings.
Is there a way to find all (nonoverlapping) matches?
Nice library, thanks!
After following your instructions to install, I got this.
import tre
ImportError: libtre.so.5: cannot open shared object file: No such file or directory
Note: I use a fedora system and sudo does not work as expected, so I connected as superuser to install.
{ 1 trackback }