Regular expression speed comparison

Background

One night on a mush far, far, away, Chris Siebenmann and I were noticing that certain regular expressions performed bizarrely; that is, what was slow was not what was expected. (specifically, we noticed that in python re.compile('xx*') performed much, much better than the equivalent re.compile('x+'))

Since Chris keeps a regular blog - I just have a wimpy little livejournal account that I rarely update - he wrote an entry about this and then investigated some more and wrote another entry about it. This page summarizes the results of that second entry, and includes a pretty table so that everyone can see what I mean.

Summary of results

In general, perl's regular expression performance beats that of the other languages we compared it to (python and ruby). When the regular expression is a fixed string, and that fixed string is present in the text being searched, perl's performance jumps into "phenomenal" territory.

However, there is one spot where perl's performance falls down, enough so that it is over 30 times as slow as ruby's engine, and that is on regular expression alternates (i.e. /FOO|BAR/). This test focused on various ways to express the regular expression /FOO BAR BAZ|FOO MORK MINK/. As such, this is deliberately poking perl where it hurts, and the results reflect that.

Test setup

This gives the tests used, in perlish notation:

Test namecode
prefix alternate/FOO (BAR BAZ|MORK MINK)/
plain alternate/(FOO BAR BAZ|FOO MORK MINK)/
two regexps/FOO BAR BAZ/ or /FOO MORK MINK/
FOO+ alternate/FOO+ (BAR BAZ|MORK MINK)/
F+OO alternate/F+OO (BAR BAZ|MORK MINK)/
la alternate/(?=FOO)(FOO BAR BAZ|FOO MORK MINK)/
plain FOO BAR BAZ/FOO BAR BAZ/

These regular expressions were tested against sample text that contained 22k characters of text with no "F", and a match or match-precursor at varying locations. "early" below means that the match-precursor mentioned occurred 30% of the way in, "middle" that the match precursor occurred 50% of the way in, "late" that the match precursor occurred 70% of the way in, and "end" that the match-precursor occurred after the 22k characters of non-matching text.

This test was performed using stock cygwin installs of perl 5.8.7, python 2.4.1, and ruby 1.8.4 on a 1.2 GHz laptop with 512MB of installed physical memory. Just for fun, I also compiled perl 5.9.3 on cygwin with defaults (except with -O3, since the stock perl had that) and tested that too. In a further fit of insanity, on 2006-03-01 I grabbed the latest bleadperl, (compiled as 5.9.3) and tested that too.

Results

The following table summarizes the results; the times listed are in µ-seconds, and represent the average time to perform the equivalent of $txt =~ /$re/ in each language. (The average was taken over 2000 tests for python and perl, and 8000 tests for ruby)

Text type regular expressiontwo regexpsplain alternatela alternateprefix alternateFOO+ alternateF+OO alternateplain FOO BAR BAZ
plain unmatchingperl 25.08 5462.00 3410.00 42.89 30.15 41.97 16.55
python 174.00 489.00 1171.00 80.00 86.50 1359.00 82.50
ruby 40.12 135.60 64.13 74.38 96.63 96.13 22.00
perl593 25.15 4816.00 5097.00 42.87 32.16 43.49 13.64
bleadperl 26.17 4810.00 5132.00 43.73 31.56 46.20 18.22
early FOO BARperl 25.94 5435.00 3420.00 44.93 30.50 62.48 12.89
python 162.50 461.00 1157.00 82.00 83.50 1311.00 83.50
ruby 39.62 135.70 154.20 120.80 128.90 162.50 22.50
perl593 25.99 4878.00 5017.00 46.12 31.57 63.95 13.63
bleadperl 26.63 4740.00 4937.00 47.70 36.28 64.04 14.14
early FOO BAR BAZperl 4.30 1607.00 1056.00 24.40 17.79 34.81 4.93
python 26.50 141.00 368.00 27.00 29.50 404.50 24.50
ruby 15.62 46.88 70.13 29.37 35.00 76.38 13.75
perl593 4.37 1497.00 1547.00 22.23 17.34 33.19 4.35
bleadperl 3.95 1386.00 1467.00 19.72 17.17 35.79 4.10
early FOO MORK MINKperl 17.86 1607.00 1021.00 22.34 17.38 33.64 14.38
python 103.50 139.00 352.00 26.00 26.50 391.00 81.50
ruby 35.37 47.75 67.38 29.13 37.38 80.88 22.12
perl593 18.00 1537.00 1482.00 21.67 16.79 34.34 14.13
bleadperl 17.71 1397.00 1467.00 20.01 17.10 32.63 14.20
end FOO BARperl 24.65 5264.00 3410.00 44.96 30.86 83.22 13.29
python 160.00 450.50 1174.00 82.00 86.50 1331.00 85.00
ruby 42.75 144.10 202.60 82.50 96.88 237.90 21.50
perl593 26.53 4943.00 5242.00 44.92 31.66 97.47 13.84
bleadperl 26.99 4665.00 4851.00 44.62 32.03 86.62 15.15
end FOO BAR BAZperl 13.09 5327.00 3410.00 51.08 41.35 93.83 13.51
python 81.00 450.00 1167.00 81.50 81.50 1345.00 79.00
ruby 29.00 144.90 201.50 81.62 113.50 240.50 28.25
perl593 16.83 4807.00 4952.00 49.29 39.05 93.98 14.35
bleadperl 14.08 4664.00 5052.00 55.22 45.36 92.97 14.44
end FOO MORK MINKperl 25.56 5552.00 3455.00 50.32 45.09 101.00 14.54
python 162.50 443.00 1174.00 81.50 84.00 1399.00 80.50
ruby 47.13 138.10 204.80 84.50 100.30 237.70 22.50
perl593 26.24 4839.00 4962.00 49.05 38.70 92.29 13.76
bleadperl 28.47 4770.00 4947.00 53.37 38.42 101.20 14.01

Consequences

One very clear consequence of this is that if you have an application in perl that depends on finding one out of many regular expressions in a piece of text, it is much more efficient to test for each regular expression separately than to use an alternate regular expression.

Also, in all languages tested, there is a clear advantage to using regular expressions which begin with constant strings rather than using other equivalent constructs. (Compare the F+OO column to the FOO+ column) Specifically, any intial construct of the form x+ should probably be written as xx*; this especially holds for python.

Finally, this test points to some cases where the different regular expression engine implementors could learn from each other.

If any Common Lisp people out there want to contribute a version of the benchmark using that, I'll try to figure out how to get clisp and cl-pcre installed here and include it. Likewise C++ people using the pcre library.

As a subjective observation, note that I (as someone who writes a fair amount of perl regularly; of these three languages, perl is the one I know best) found it much easier to do the translation of the benchmark into ruby from the original python than to do the translation into perl from python. This suggests that python programmers who are looking at switching languages for a particular program because of regular expression performance should be aware that they will probably find conversion to ruby easier to do, assuming that ruby's regular expression engine meets their performance needs.

Code

Perl benchmark code (original location)
The perl version of the benchmark
Python benchmark code (original location)
The python version of the benchmark
Ruby benchmark code
The ruby version of the benchmark
HTML report template
The HTML template from which this page is generated
out2html.pl
The perl script which turns a bunch of test output into a pretty html table