jakarta-oro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel F. Savarese" <...@savarese.org>
Subject Re: Wanted: Regex[Input|Output]Stream
Date Thu, 04 Sep 2003 15:45:40 GMT

In message <NBBBJGEAGJAKLIDBKJOPKEGLDNAB.noel@devtech.com>, "Noel J. Bergman" w
rites:
>Does anyone know of an implementation of the titular classes?
>
>Basically, to compile multiple regex patterns into a stream, and then as I
>read through the stream, the data should be optimally checked.  This is a
>classic FSA situation.  If a match occurs, I think that I'd like to receive
>an exception at that point in the stream, but the ata would still be valid,
>and I can continue processing.

The org.apache.oro.text.awk package will perform matches on streams
(see http://jakarta.apache.org/oro/api/org/apache/oro/text/awk/AwkMatcher.html#contains(org.apache.oro.text.awk.AwkStreamInput,%20org.apache.oro.text.regex.Pattern)
,
return offset information, and allow continued processing.  It doesn't
throw an exception though and it only searches for a single pattern, so
you have to use alternation to match multiple patterns (which, unlike
Perl, does not require backtracking with Awk patterns).  The downside
is the Awk package is limited to 8-bit input to keep the DFA tables it
builds (in lazy fashion) small.  The functionality was removed from the
jakarta-oro Perl classes on account of:

"On the Use of Regular Expressions for Searching Text", Clark and Cormack,
  ACM Transactions on Programming Languages and Systems, Vol 19, No. 3,
  pp 413-426.

In short, the behavior of Perl regexes is such that in most cases you can't
determine a match in a stream without reading the entire stream.  TCL
regexes have the same problem and Expect used to (still does?) handle the
problem by looking for a match in a lookahead buffer (2000 characters
sounds familiar) and giving up if not found, so you could lose matches
if boundary conditions came into play (e.g., a match ran off the end of
the buffer).  If you can bound the length of a match, then it's a problem
that can be solved efficiently.  It may be worth reintroducing the
functionality as a standard part of jakarta-oro, but requiring the
programmer to specify a limit on the size of a match.

daniel



Mime
View raw message