lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislaw Osinski" <>
Subject Re: StandardTokenizer is slowing down highlighting a lot
Date Thu, 26 Jul 2007 06:53:45 GMT
On 25/07/07, Yonik Seeley <> wrote:
> On 7/25/07, Stanislaw Osinski <> wrote:
> > JavaCC is slow indeed.
> JavaCC is a very fast parser for a large document... the issue is
> small fields and JavaCC's use of an exception for flow control at the
> end of a value.  As JVMs have advanced, exception-as-control-flow as
> gotten comparably slower.

In Carrot2 we tokenize mostly very short documents (search results), so in
this context JFlex proved much faster. I did a very rough performance test
of Highlighter using JFlex and JavaCC-generated analyzers with medium-sized
documents (up to ~1kB), and JFlex was still faster. What size would a
'large' document be?

Does JFlex have a jar associated with it?  It's GPL (although you can
> freely use the files it generates under any license), so if there were
> other non-generated files required, we wouldn't be able to incorporate
> them.

You need JFlex jar only to generate the tokenizer (one Java class). The
generated tokenizer is standalone and doesn't need the JFlex jar to run.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message