lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
Date Tue, 31 Jul 2007 19:04:53 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516775
] 

Michael McCandless commented on LUCENE-966:
-------------------------------------------


I agree, let's try to perfectly match the tokens of the old
StandardAnalyzer so we have a way-faster drop-in replacement.

The speedups of JFlex are amazing: based on a quick test, with JFlex +
patch from LUCENE-969, the new StandardAnalyzer is only 2.09X slower
than WhitespaceAnalyzer even though it's doing so much more ...

> Finally, when it comes to the initialization time of the new
> tokenizer -- according to the JFlex documentation, some time is
> required to unpack the transition tables. But the unpacking takes
> place during the initialization of static fields, so once the class
> is loaded the overhead should be negligible.

Yeah I'm baffled why it's that much slower, but on 100 token docs I
definitely see LUCENE-969 making things 84% faster but "only" 36%
faster if I use the full Wikipedia doc (which are much larger than 100
tokens on average).  If we tested even smaller docs I think the gains
would be even more.

When I ran under the profiler it was the StandardTokenizerImpl
<init>(java.io.Reader) way on the top.  Maybe it's the cost of new'ing
the 16 KB buffer each time?

In any event I think it's OK, so long as we get LUCENE-969 in, and
document the importance of using reusableTokenStream() API for better
performance.


> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>
>                 Key: LUCENE-966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-966
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>
>         Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt,
jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several times) replacement
for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message