lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Created: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText
Date Mon, 30 Jul 2007 12:41:52 GMT
Optimize the core tokenizers/analyzers & deprecate Token.termText

                 Key: LUCENE-969
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
    Affects Versions: 2.3
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 2.3

There is some "low hanging fruit" for optimizing the core tokenizers
and analyzers:

  - Re-use a single Token instance during indexing instead of creating
    a new one for every term.  To do this, I added a new method "Token
    next(Token result)" (Doron's suggestion) which means TokenStream
    may use the "Token result" as the returned Token, but is not
    required to (ie, can still return an entirely different Token if
    that is more convenient).  I added default implementations for
    both next() methods in so that a TokenStream can
    choose to implement only one of the next() methods.

  - Use "char[] termBuffer" in Token instead of the "String

    Token now maintains a char[] termBuffer for holding the term's
    text.  Tokenizers & filters should retrieve this buffer and
    directly alter it to put the term text in or change the term

    I only deprecated the termText() method.  I still allow the ctors
    that pass in String termText, as well as setTermText(String), but
    added a NOTE about performance cost of using these methods.  I
    think it's OK to keep these as convenience methods?

    After the next release, when we can remove the deprecated API, we
    should clean up to no longer maintain "either String or
    char[]" (and the initTermBuffer() private method) and always use
    the char[] termBuffer instead.

  - Re-use TokenStream instances across Fields & Documents instead of
    creating a new one for each doc.  To do this I added an optional
    "reusableTokenStream(...)" to Analyzer which just defaults to
    calling tokenStream(...), and then I implemented this for the core

I'm using the patch from LUCENE-967 for benchmarking just

The changes above give 21% speedup (742 seconds -> 585 seconds) for
LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
IO system (best of 2 runs).

If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
(1236 sec -> 774 sec), I think because of re-using TokenStreams across

I'm just running with this alg and recording the elapsed time:


  {ReadTokens > : *

See this thread for discussion leading up to this:

I also fixed Token.toString() to work correctly when termBuffer is
used (and added unit test).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message