lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: new Token API
Date Mon, 19 Nov 2007 17:01:28 GMT
"Yonik Seeley" <> wrote:
> On Nov 18, 2007 6:07 AM, Michael McCandless <>
> wrote:
> > a quick test tokenizing all of Wikipedia w/
> > SimpleAnalyzer showed 6-8% overall slowdown if I call token.clear() in
> >
> We could slim down clear() a little by only resetting certain things...
> startOffset and endOffset need to be set each time if anyone cares
> about offsets, so they don't really need to be reset.  The only
> tokenizer to use "type" sets it every time AFAIK, so would could argue
> for skipping that as well.  Not sure if the small performance gain
> would be worth it though.

Honestly I was surprised by how sizable the performance difference was
when clearing each token.  I don't understand why.  I wonder if more
frequently setting pointers to null somehow causes GC to kick in more
often or something?  (I was using Sun's JDK 1.5.0_08 on Linux).  If so
it could be setting payloadLength=0 (once payload is inlined) would be
faster than setting payloadBytes=null.

And, maybe, we should in fact have a local payload byte[] instead of
by reference so we don't keep changing that pointer with every token.

Anyway, I do think it's worth paring back to what absolutely must be
cleared?  We could even reset the fields directly from
DocumentsWriter.  I've found that keeping good performance requires
being absurdly vigilant: if we slip a bit here and a bit there then
suddenly we'll find that we've become slow.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message