lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Token termBuffer issues
Date Tue, 24 Jul 2007 21:45:57 GMT
"Yonik Seeley" <> wrote:
> On 7/24/07, Michael McCandless <> wrote:
> > OK, I ran some benchmarks here.
> >
> > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> > 17.2% speedup using Sun's JDK 6, on Linux.  This is indexing all
> > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > PorterStemFilter.  I think it's worth pursuing!
> Did you try it w/o token reuse (reuse tokens only when mutating, not
> when creating new tokens from the tokenizer)?

I haven't tried this variant yet.  I guess for long filter chains the
GC cost of the tokenizer making the initial token should go down as
overall part of the time.  Though I think we should still re-use the
initial token since it should (?) only help.

> It would be interesting to see what's attributable to Token reuse only
> (after core filters have been optimized to use the char[] setters,
> etc).

Good question; it could be the gains are mostly from switching to
char[] termBuffer and less so from Token instance re-use.  Too many
tests to try :)

> We've had issues in the past regarding errors with filters dealing
> with token properties:
>  1)  filters creating a new token from and old token, but forgetting
> about setting positionIncrement
>  2) legacy filters losing "new" information such as payloads when
> creating , because they didn't exist when the filter was written.
> #1 is solved by token mutation because there are setters for the value
> (before, the filter author was forced to create a new token, unless
> they could access the package-private String).

Ahhh, good!

> #2 can now be solved by clone() (another relatively new addition)
> So what new problems might crop up with token reuse?
>  - a filter reusing a token, but not zeroing out something new like
> payloads because they didn't exist when the filter was authored (the
> opposite problem from before)
> Would a Token.clear() be needed for use by (primarily) tokenizers?

Hmm, good point; I like the clear() idea.  I will add that.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message