One night on a mush far, far, away, Chris Siebenmann and I were noticing that certain regular expressions performed bizarrely; that is, what was slow was not what was expected. (specifically, we noticed that in python re.compile('xx*') performed much, much better than the equivalent re.compile('x+'))
Since Chris keeps a regular blog - I just have a wimpy little livejournal account that I rarely update - he wrote an entry about this and then investigated some more and wrote another entry about it. This page summarizes the results of that second entry, and includes a pretty table so that everyone can see what I mean.
In general, perl's regular expression performance beats that of the other languages we compared it to (python and ruby). When the regular expression is a fixed string, and that fixed string is present in the text being searched, perl's performance jumps into "phenomenal" territory.
However, there is one spot where perl's performance falls down, enough so that it is over 30 times as slow as ruby's engine, and that is on regular expression alternates (i.e. /FOO|BAR/). This test focused on various ways to express the regular expression /FOO BAR BAZ|FOO MORK MINK/. As such, this is deliberately poking perl where it hurts, and the results reflect that.
This gives the tests used, in perlish notation:
|prefix alternate||/FOO (BAR BAZ|MORK MINK)/|
|plain alternate||/(FOO BAR BAZ|FOO MORK MINK)/|
|two regexps||/FOO BAR BAZ/ or /FOO MORK MINK/|
|FOO+ alternate||/FOO+ (BAR BAZ|MORK MINK)/|
|F+OO alternate||/F+OO (BAR BAZ|MORK MINK)/|
|la alternate||/(?=FOO)(FOO BAR BAZ|FOO MORK MINK)/|
|plain FOO BAR BAZ||/FOO BAR BAZ/|
These regular expressions were tested against sample text that contained 22k characters of text with no "F", and a match or match-precursor at varying locations. "early" below means that the match-precursor mentioned occurred 30% of the way in, "middle" that the match precursor occurred 50% of the way in, "late" that the match precursor occurred 70% of the way in, and "end" that the match-precursor occurred after the 22k characters of non-matching text.
This test was performed using stock cygwin installs of perl 5.8.7, python 2.4.1, and ruby 1.8.4 on a 1.2 GHz laptop with 512MB of installed physical memory. Just for fun, I also compiled perl 5.9.3 on cygwin with defaults (except with -O3, since the stock perl had that) and tested that too. In a further fit of insanity, on 2006-03-01 I grabbed the latest bleadperl, (compiled as 5.9.3) and tested that too.
The following table summarizes the results; the times listed are in µ-seconds, and represent the average time to perform the equivalent of $txt =~ /$re/ in each language. (The average was taken over 2000 tests for python and perl, and 8000 tests for ruby)
|Text type regular expression||two regexps||plain alternate||la alternate||prefix alternate||FOO+ alternate||F+OO alternate||plain FOO BAR BAZ|
|early FOO BAR||perl||25.94||5435.00||3420.00||44.93||30.50||62.48||12.89|
|early FOO BAR BAZ||perl||4.30||1607.00||1056.00||24.40||17.79||34.81||4.93|
|early FOO MORK MINK||perl||17.86||1607.00||1021.00||22.34||17.38||33.64||14.38|
|end FOO BAR||perl||24.65||5264.00||3410.00||44.96||30.86||83.22||13.29|
|end FOO BAR BAZ||perl||13.09||5327.00||3410.00||51.08||41.35||93.83||13.51|
|end FOO MORK MINK||perl||25.56||5552.00||3455.00||50.32||45.09||101.00||14.54|
One very clear consequence of this is that if you have an application in perl that depends on finding one out of many regular expressions in a piece of text, it is much more efficient to test for each regular expression separately than to use an alternate regular expression.
Also, in all languages tested, there is a clear advantage to using regular expressions which begin with constant strings rather than using other equivalent constructs. (Compare the F+OO column to the FOO+ column) Specifically, any intial construct of the form x+ should probably be written as xx*; this especially holds for python.
Finally, this test points to some cases where the different regular expression engine implementors could learn from each other.
If any Common Lisp people out there want to contribute a version of the benchmark using that, I'll try to figure out how to get clisp and cl-pcre installed here and include it. Likewise C++ people using the pcre library.
As a subjective observation, note that I (as someone who writes a fair amount of perl regularly; of these three languages, perl is the one I know best) found it much easier to do the translation of the benchmark into ruby from the original python than to do the translation into perl from python. This suggests that python programmers who are looking at switching languages for a particular program because of regular expression performance should be aware that they will probably find conversion to ruby easier to do, assuming that ruby's regular expression engine meets their performance needs.