lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
Date Tue, 31 Jul 2007 17:13:52 GMT


Michael McCandless commented on LUCENE-966:

I took the patch from here (to use jflex for StandardAnalyzer) and
merged it with the patch from LUCENE-969 (re-use Token & TokenStream)
to measure the net performance gains.

I measure the time to just tokenize all of Wikipedia using
StandardAnalyzer using contrib/benchmark plus patch from LUCENE-967
(test details are described in LUCENE-969).

With the jflex patch it takes 646 sec (best of 2 runs); when I then
merge in the patch from LUCENE-969 it takes 455 sec.  Subtracting off
the time to just load all Wikipedia docs (= 112 sec) that gives net
additional speedup of 36% (534 sec -> 343 sec) when using LUCENE-969
in addition to jflex.

A couple other things I noticed:

  * The init cost of jflex (StandardTokenizerImpl) seems to be fairly
    high: when I repeat the above test with smallish docs (100 tokens
    each) instead, the gain is around 84%.  I think this just makes
    the new reusableTokenStream() in LUCENE-969 important to commit.

  * I'm seeing differing token counts with the jflex StandardAnalyzer
    vs the current one; I think there is some difference here.  I will
    track down which tokens differ and post back...

> A faster JFlex-based replacement for StandardAnalyzer
> -----------------------------------------------------
>                 Key: LUCENE-966
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Stanislaw Osinski
>             Fix For: 2.3
>         Attachments:, jflex-analyzer-patch.txt, jflex-analyzer-r560135-patch.txt,
> JFlex ( can be used to generate a faster (up to several times) replacement
for StandardAnalyzer. Will add a patch and a simple benchmark code in a while.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message