Approximate Regex Matching in Python

by Ville Laurikari on Saturday, September 19, 2009

Post image for Approximate Regex Matching in Python

We all love a little regex hacking now and then. I loved it enough to even write a regex matching library called libtre. The cool thing about this library is that it supports searching for approximate matches.

The approximate matching features of this library are being used for things like improving OCR results, generating “did you mean?” suggestions for users’ searches, and filtering spam. The library comes with a 2-clause BSD license so you can use it for pretty much whatever you want.

Installing the Python extension

The extension works with Python 2. Python 3 is not yet supported.

Windows

Install the prebuilt Python extension: tre-0.8.0.win32-py2.6.msi. The package works with Python 2.6 from python.org.

Linux, OS X, and other Unix

Although you can probably find prebuilt binaries for your system, the Python extension is usually not included. First, either install some of those binaries or compile from source:

wget http://laurikari.net/tre/tre-0.8.0.tar.bz2
tar xjvf tre-0.8.0.tar.bz2
cd tre-0.8.0
./configure
sudo make install

Then, build and install the Python extension:

cd python
umask 022
sudo python setup.py install

Taking it for a test drive

The first thing you can try is run the example program included with the extension. Here’s the code for example.py:

import tre
 
pt = tre.compile("Don(ald)?( Ervin)? Knuth", tre.EXTENDED)
data = """
In addition to fundamental contributions in several branches of
theoretical computer science, Donnald Erwin Kuth is the creator of
the TeX computer typesetting system, the related METAFONT font
definition language and rendering system, and the Computer Modern
family of typefaces.
"""
 
fz = tre.Fuzzyness(maxerr = 3)
print fz
m = pt.search(data, fz)
 
if m:
    print m.groups()
    print m[0]

When you run example.py the output should look like this:

tre.Fuzzyness(delcost=1,inscost=1,maxcost=2147483647,subcost=1, maxdel=2147483647,maxerr=3,maxins=2147483647,maxsub=2147483647)
((95, 113), (99, 108), (102, 108))
Donnald Erwin Kuth

So what does this program do? First we compile the regex Don(ald( Ervin)?)? Knuth. This simple regex matches any of the following three strings:

  • Donald Ervin Knuth
  • Donald Knuth
  • Don Knuth

The interesting part is obviously this:

fz = tre.Fuzzyness(maxerr = 3)
m = pt.search(data, fz)

This searches for an approximate match to the regex in data, with at most three errors. The search finds “Donnald Erwin Kuth” which matches our regex within those three errors.

Some more details

The tre module contents are similar to the contents of the re module which comes with Python. Here are the key parts:

tre.compile(pattern[, flags])
Compiles a regex pattern into a regex object. The regex object can then be used to search for approximate matches using its search() method.

A warning about regex syntax: this library uses the POSIX regex syntax, which is not 1:1 compatible with the usual Python regex syntax.

tre.Fuzzyness([delcost=1, inscost=1, subcost=1, maxcost=<unlimited>, maxdel=<unlimited>, maxins=<unlimited>, maxsub=<unlimited>, maxerr=<unlimited>])
Creates a “fuzzyness” object, used with the search() method to restrict the number of errors allowed.

RegexObject.search(string, fuzzyness[, flags])
Searches for a match in the string with the given fuzzyness.

So, as usual with regex libraries, you first compile a regex and then use it to find matches. The fuzzyness object defines how many errors of which kinds the match is allowed to have. The different kinds of errors are:

  • insertion: an extra character, like changing Donald to Donnald
  • deletion: a missing character, like changing Knuth to Kuth
  • substitution: a changed character, like changing Ervin to Erwin

The total number of allowed errors is controlled by the maxerr field. This is probably enough for most applications.

For more fine-grained control, the maximum number of errors can be controlled for each error type separately with maxins, maxdel, and maxsub. The allowed errors can also be limited using costs. The default cost of each error is one, but can be set to any value using inscost, delcost, and subcost. The maximum cost is set with maxcost.

There’s a lot more I could say, but I’ll save my breath for further posts. I’ll just close with a modification on the by now old regex matching adage first uttered in 1997:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions with approximate matching.” Now they have, like, maybe three or so problems.

Related posts:

  1. Is Your Regex Matcher Up to Snuff?
  2. How to Distribute Commercial Python Applications
  3. Overriding System Functions for Fun and Profit

If you liked this, click here to receive new posts in a reader.
You should also follow me on Twitter here.

Comments on this entry are closed.

{ 21 comments }

hackerboss September 20, 2009 at 00:13

Approximate Regex Matching in Python: http://bit.ly/37C4X6

This comment was originally posted on Twitter

technoriders September 22, 2009 at 13:08

Development Tips: “Approximate Regex Matching in Python” ( http://bit.ly/eLuKx )

This comment was originally posted on Twitter

_pythonic September 22, 2009 at 16:17

Approximate Regex Matching in Python by @hashedbits http://ff.im/-8wyEa

This comment was originally posted on Twitter

ralfe September 22, 2009 at 16:53

Approximate Regular Expression Matching in Python : http://bit.ly/UsFLu #python #regularexpressions

This comment was originally posted on Twitter

shadytrees September 22, 2009 at 22:46

Appropriate regex matching library in Python by Ville Laurikari. http://bit.ly/MP6M5

This comment was originally posted on Twitter

lorg September 23, 2009 at 01:08

I’ll be sure to try it out next time I’ll be doing approximate text parsing. I’ve done it a few times for “half natural half formal” languages, and I can see how this tool could help.

I wonder if and how it should be integrated with nltk ( http://www.nltk.org )

plisk September 23, 2009 at 03:53

Nice, thank you ..

This comment was originally posted on Reddit

Alex Dedul September 23, 2009 at 03:54

Nice library, thank you! :)

techime September 23, 2009 at 06:55

Approximate Regex Matching in Python http://bit.ly/UsFLu

This comment was originally posted on Twitter

loggly September 23, 2009 at 20:32

Tired of exacting regex? Go for approximation: http://bit.ly/UsFLu

This comment was originally posted on Twitter

ephes September 23, 2009 at 22:48

Approximate Regex Matching in Python http://j.mp/11rITI

This comment was originally posted on Twitter

zhasm September 25, 2009 at 16:28

推荐网文:《Approximate Regex Matching in Python》(python语言的近似正则匹配)http://bit.ly/1pOWyC #regex #正则#

This comment was originally posted on Twitter

Austin January 20, 2012 at 19:46

This will be used for Bioinformatics by me.

Joseph January 20, 2012 at 20:15

How fast is it, compared to the regular regexps?

Ivan Savov January 21, 2012 at 08:20

Very cool….

Could you give some more examples with words?

Why isn’t this working?

qm = tre.compile("(Quantum (Mechanics|Physics))" , tre.EXTENDED )

?

Ville Laurikari January 21, 2012 at 08:33

@Ivan, it does work for me; the regex compiles fine and matches either “Quantum Mechanics” or “Quantum Physics”. What’s the problem, exactly?

Ville Laurikari January 21, 2012 at 08:36

@Joseph, it’s a bit slower (by constant factors), but can still be quite fast. I have no benchmark results to show.

Ivan Savov January 21, 2012 at 09:26

Ok cool.

I forgot that search() and findall() aren’t the same thing ;)

Is there a way to add docs to a python C module?
I am lost without my ipython tools:

In [47]: qm.search?
Docstring: try to match against given string,
returning tre.match object or None on failure
In [48]: %pdef qm.search
No definition header found for qm.search

I am thinking of using re objects for “smart tags” that unite multiple versions of the tag into a compact expression that a RE-literate person can interpret.

Thx for your help.

Ville Laurikari January 21, 2012 at 09:56

The doc strings are defined in the C file, AFAIK not possible to edit without recompiling. Perhaps a small .py wrapper should be written around the raw, almost 1:1, mapping for a more pythonic interface. That file would also be a good home for doc strings.

Sefa June 11, 2012 at 21:53

Is there a way to find all (nonoverlapping) matches?

Nice library, thanks!

ai_maniac February 20, 2014 at 01:03

After following your instructions to install, I got this.
import tre
ImportError: libtre.so.5: cannot open shared object file: No such file or directory

Note: I use a fedora system and sudo does not work as expected, so I connected as superuser to install.

{ 1 trackback }

Additional comments powered by BackType

Previous post:

Next post: