Mirror of the gdb mailing list
 help / color / mirror / Atom feed
From: Jim Blandy <jimb@zwingli.cygnus.com>
To: Eli Zaretskii <eliz@is.elta.co.il>
Cc: dj@redhat.com, gcc@gcc.gnu.org, gdb@sources.redhat.com,
	binutils@sources.redhat.com, cygwin@sources.redhat.com
Subject: Re: Another RFC: regex in libiberty
Date: Mon, 11 Jun 2001 22:49:00 -0000	[thread overview]
Message-ID: <nplmmyz1qv.fsf@zwingli.cygnus.com> (raw)
In-Reply-To: <9003-Fri08Jun2001100651+0300-eliz@is.elta.co.il>

"Eli Zaretskii" <eliz@is.elta.co.il> writes:
> One notorious problem with GNU regex is that it is quite slow for many
> simple jobs, such as matching a simple regular expression with no
> backtracking.  It seems that the main reason for this slowness is the
> fact that GNU regex supports null characters in strings.  For
> examnple, Sed 3.02 compiled with GNU regex is about 2-4 times slower
> on simple jobs than the same Sed compiled with Spencer's regex
> library.  (The DJGPP port of Sed is actually distributed with two
> executables, one build with GNU regex, the other with Spencer's, for
> this very reason.)

I'm suspicious of this assertion.  I've worked on GNU regexp in the
past, and I don't see any reason this should be so.

However, I was messing around with regexps a lot when GNU regexp
suddenly became slow on certain regexps.  I looked into the cause, and
it turned out that this was because GNU regexp had been made to comply
with the POSIX regexp specification.  POSIX regexp semantics require
that the regexp match the longest possible string (I may have the
details wrong, but it's something like that).  For backtracking regexp
engines (the GNU, Henry Spencer, and Perl regexp matchers are all of
this design), that innocent-sounding constraint basically requires
insanely slow behavior.

GNU regexp has a flag that allows you to turn this behavior off, and
get the traditional, faster, non-POSIX-compliant behavior.  So I don't
see any reason the GNU regexp library couldn't serve all the GPL'd
software's needs.

----

The details, for the curious:

Suppose you have a regexp like this (assume the obvious
metacharacters)

  (FOObar|FOO)(barbar)+
   ---a-- -b-  ---c--      <= labels for pieces of the regexp

which you're matching against the string

  FOObarbarbarbar
  0  3  6  9  12

and let's suppose you're calling the regexp library in a manner which
asks "does a prefix of this string match this regexp?"  (That is,
you're not asking "does this regexp match this entire string?")

The traditional behavior is for the regexp engine to match part ---a--
of the regexp against data[0..5], match one repetition of part ---c--
against data[6..8], and say it's done.  The Perl regexp matcher will
return this match.

But this isn't the behavior POSIX requires.  POSIX says you must
return the *longest possible* match.  So a POSIX-compliant matcher
must match -b- against data[0..2], and then two repetitions of ---c--
against data[3..8] and data[9..14].  This is a longer match.

To find the longest match, in general, a backtracking matcher has to
generate every possible match, and return the longest one it found.
This is what GNU regexp does.

So, just how bad is this?  Well, suppose you're matching a regexp like:

        .*.*.*.*.*.*

against a string like

     aaaaaaaaaaaaaaaaaaaa

To generate every possible match, you have to choose every possible
way to divide up those twenty a's amongst six .* patterns.  I think
this is 20 choose 5, or 1.9 million, matches you have to try.  In
general, I think the time to match POSIXly can increase exponentially
in the length of your regexp, given a long enough data string.

If you have a smart regexp compiler (I understand Perl is pretty
clever), then you could probably handle this regexp with a bit more
aplomb.  But I'll bet that I can find a regexp where you really do
have to try all matchings, no matter how smart your regexp compiler
is.

(So think of that the next time you propose some innocent-sounding
constraint, like "longest match always"!)

Anyway, the outcome is that all the really popular regexp matchers
either don't implement the POSIX behavior, or provide options to turn
it off.  For GNU regexp, you can use the RE_NO_POSIX_BACKTRACKING
flag, and you'll get the traditional not-always-the-longest-match nice
fast behavior.  Perl simply documents the traditional behavior ("The
[Perl regexp] engine thinks locally and acts globally," as the Camel
book puts it).


  parent reply	other threads:[~2001-06-11 22:49 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <Daniel>
     [not found] ` <Vogel's>
     [not found]   ` <message>
     [not found]     ` <of>
     [not found]       ` <Mon,>
     [not found]         ` <01>
     [not found]           ` <Nov>
     [not found]             ` <1999>
     [not found]               ` <14:25:01>
     [not found]                 ` <+0100>
     [not found]                   ` <381D94AD.B37EC167@grafzahl.de>
1999-11-08  8:54                     ` go32-nat.c compilation problem Pierre Muller
     [not found]       ` <Fri,>
     [not found]         ` <08>
     [not found]           ` <Jun>
     [not found]             ` <2001>
     [not found]               ` <10:06:51>
     [not found]                 ` <+0300>
2001-06-07 18:27                   ` Another RFC: regex in libiberty DJ Delorie
2001-06-07 18:31                     ` Ian Lance Taylor
2001-06-07 18:33                       ` DJ Delorie
2001-06-07 18:43                         ` Ian Lance Taylor
2001-06-08  0:11                     ` Eli Zaretskii
2001-06-08  9:18                       ` Mark Mitchell
2001-06-08  9:59                       ` Zack Weinberg
2001-06-08 10:05                         ` H . J . Lu
2001-06-08 10:31                           ` Eli Zaretskii
2001-06-08 10:39                             ` H . J . Lu
2001-06-08 10:37                         ` Eli Zaretskii
2001-06-11 22:49                       ` Jim Blandy [this message]
2001-06-11 23:51                         ` Randall R Schulz
2001-06-12  6:48                         ` Jim Blandy
2001-06-08  1:15                     ` Pierre Muller
2001-06-08  1:36                       ` About struct bpp_transfer_params ±èµæÁß
2001-06-08  7:43                         ` Fernando Nasser
2001-06-09 13:34                     ` Another RFC: regex in libiberty Andrew Cagney
     [not found] <Eli>

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nplmmyz1qv.fsf@zwingli.cygnus.com \
    --to=jimb@zwingli.cygnus.com \
    --cc=binutils@sources.redhat.com \
    --cc=cygwin@sources.redhat.com \
    --cc=dj@redhat.com \
    --cc=eliz@is.elta.co.il \
    --cc=gcc@gcc.gnu.org \
    --cc=gdb@sources.redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox