jakarta-oro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel F. Savarese" <...@savarese.org>
Subject Re: ORO Performance
Date Mon, 18 Nov 2002 23:14:56 GMT

In message <003701c28c47$66018ee0$0600a8c0@brutesquadlabs.com>, "Bob Dickinson 
\(BSL\)" writes:
><tolerance-plea>
>Upon completing this message, it occurs to me that it is REALLY long and
>covers a lot of ground.  If the size of this message intimidates or annoys
>you (as I imagine it might me), then just consider that this is a good three
>or four weeks of stuff that I could have been sending one or two paragraphs

I guess you resolved most of this with your last email, but I'll comment
on a couple of things.  The Awk stuff does use a lazy DFA builder to
minimize memory consumption (only builds those states that are actually
visited).  However, the actual DFA structure is probably not the best.
Re: 8-bits.  I don't know how to build efficient DFA matchers
for Unicode.  I learned how to build parsers and lexers over 10 years ago
when more than 8-bits to represent a character was something that was
impractical and would balloon your memory use (I think I was using a
386SX with 1 MB of RAM at the time).  I've not taken the time to study
memory-conserving (if there are any) DFA implementation techniques for
a complete 16-bit character set.

You've probably already figured this out by now after looking at Perl5Matcher,
but there's no use in trying to convert it to a DFA or hybrid NFA/DFA
implementation (I'm 99.99% sure it's not possible to implement Perl5 regexes
as a pure DFA anyway).  It would be better to start a whole new engine,
but as you've already indicated, that takes time.  But, as you've also
already determined, there are opportunities to make Perl5Matcher faster.
I was about to say that the NFA produced by Perl5Compiler could not be
converted into a DFA because of its representation, but that's not
necessarily true.  It just wouldn't be easy since it's not table based.
I have thought about the possibility of compiling the interpreted
op codes into Java byte-code, but I've never explored it.  It should
be doable and very fast, but I don't know about the memory footprint
since you'd be creating a new class for each pattern.

Alternations are slow because they require backtracking and recursion.
There's probably a way to improve this, but alternations have always
been a problem.  If you don't need to save groups, don't save them,
but you already know that.  Unnecessarily saving groups will cause 
extra memory allocation, although we can probably reduce that a bit 
the same way we can with Perl5Repetition (see __pushState() and
__popState()).

My last thought is that if I were you, I'd crack out a C-based regex
implementaiton (that actually delivered on your performance requirements)
and call it from Java with JNI.  I'm sure there's a good reason why
you're not doing this.  In general, I shun Java as an implementation
medium for large scale text processing, although this may have changed
with java.nio.  Still, there are lots of reason why we use Java for
things it's not very good at.

daniel



--
To unsubscribe, e-mail:   <mailto:oro-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:oro-dev-help@jakarta.apache.org>


Mime
View raw message