lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislaw Osinski" <stanislaw.osin...@man.poznan.pl>
Subject Re: StandardTokenizer is slowing down highlighting a lot
Date Thu, 26 Jul 2007 06:53:45 GMT
On 25/07/07, Yonik Seeley <yonik@apache.org> wrote:
>
> On 7/25/07, Stanislaw Osinski <stanislaw.osinski@man.poznan.pl> wrote:
> > JavaCC is slow indeed.
>
> JavaCC is a very fast parser for a large document... the issue is
> small fields and JavaCC's use of an exception for flow control at the
> end of a value.  As JVMs have advanced, exception-as-control-flow as
> gotten comparably slower.


In Carrot2 we tokenize mostly very short documents (search results), so in
this context JFlex proved much faster. I did a very rough performance test
of Highlighter using JFlex and JavaCC-generated analyzers with medium-sized
documents (up to ~1kB), and JFlex was still faster. What size would a
'large' document be?

Does JFlex have a jar associated with it?  It's GPL (although you can
> freely use the files it generates under any license), so if there were
> other non-generated files required, we wouldn't be able to incorporate
> them.


You need JFlex jar only to generate the tokenizer (one Java class). The
generated tokenizer is standalone and doesn't need the JFlex jar to run.

Staszek

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message