lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
Date Tue, 31 Jul 2007 19:04:53 GMT


Michael McCandless commented on LUCENE-966:

I agree, let's try to perfectly match the tokens of the old
StandardAnalyzer so we have a way-faster drop-in replacement.

The speedups of JFlex are amazing: based on a quick test, with JFlex +
patch from LUCENE-969, the new StandardAnalyzer is only 2.09X slower
than WhitespaceAnalyzer even though it's doing so much more ...

> Finally, when it comes to the initialization time of the new
> tokenizer -- according to the JFlex documentation, some time is
> required to unpack the transition tables. But the unpacking takes
> place during the initialization of static fields, so once the class
> is loaded the overhead should be negligible.

Yeah I'm baffled why it's that much slower, but on 100 token docs I
definitely see LUCENE-969 making things 84% faster but "only" 36%
faster if I use the full Wikipedia doc (which are much larger than 100
tokens on average).  If we tested even smaller docs I think the gains
would be even more.

When I ran under the profiler it was the StandardTokenizerImpl
<init>( way on the top.  Maybe it's the cost of new'ing
the 16 KB buffer each time?

In any event I think it's OK, so long as we get LUCENE-969 in, and
document the importance of using reusableTokenStream() API for better

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>                 Key: LUCENE-966
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>         Attachments:, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt,
> JFlex ( can be used to generate a faster (up to several times) replacement
for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message