jakarta-oro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel F. Savarese" <...@savarese.org>
Subject Re: Wanted: Regex[Input|Output]Stream
Date Thu, 04 Sep 2003 19:32:31 GMT

In message <NBBBJGEAGJAKLIDBKJOPOEJNDNAB.noel@devtech.com>, "Noel J. Bergman" w
rites:
>An alternative would be to add an Observer (actually, wouldn't that be a
>Listener, to remain consistent with Java terminology? :-)) with the pattern,
>although it seems that none of the regex engines support compiling multiple
>patterns, which I find truely bizzare.  That would allow the Listener code
>to execute as each pattern is found in the stream.

There are some classes related to this in org.apache.oro.text that I've
never been particularly satisfied with.  MatchAction is basically
a listener/observer for MatchActionProcessor, which process input
line by line awk-style and invokes registered MatchActions (when their
respective patterns are matched.  The classes are too special-purpose
and are geared toward filtering operations, requiring an output stream
to be provided along with an input and supporting an awk-style field
separator.  However, the motivation for them is similar and this might
be a good opportunity to generalize their basis to accommodate the
use-case you have in mind.

Actually, on r-reading your original message I think I didn't understand
what you were looking for.  You want to be able to read from
an input stream or write to an output stream as you normally would.
But transparent to this reading or writing, you want the data read or
the data written to be tested for pattern matches (on the continuous
stream of data) and be notified of these matches.  If I understand
that correctly, then that would be a different from what I had
initially understood (although you can use a tee-like stream copier to
graft on AwkStreamInput).  In that case, the kicker is the problem
I mentioned in my last message about not being able to
definitively identify matches in a stream without reading (and buffering)
the entire stream.  This is not a problem with strictly DFA matching
as per AwkMatcher, but is a problem for Perl-type matching.  If you are
willing to live with an Expect-like compromise of limited buffering or
my suggestion of specifying a bound on the length of a match, this is
tractable and perhaps an appropriate addition to org.apache.oro.io
(at the same time we can take the opportunity to implement the matcher
factories we've been putting off so we can wrap jakarta-regexp and
java.util.regex or whatever and you can use whatever regular expression
package you want with the class).

I think the interface is will be some variation of what Leo offered
(with whatever additional tweaks may arise; e.g., you have to be able
to specify the patterns to be matched, maybe you want to register
multiple listeners instead of just one), but the internals of an
implementation are dependent on the limitations I mentioned.  There's
a quick and dirty way to get this done, but I believe later fine-tuning
for performance or soundness of overall design may require whatever regex
matcher is used to support incremental matching.  For example, if you're
in the middle of a match but more data hasn't been written/read to
determine whether a match exists, you want to be able to continue the
matching process from where it left off rather than restarting from
scratch.  Preserving matching progress is definitely doable for DFAs,
but NFAs may just have to start over again.  In practice, it may not
make a big difference which is why I would go the quick and dirty way
first.

daniel



Mime
View raw message