lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Token termBuffer issues
Date Sat, 21 Jul 2007 13:12:16 GMT

"Chris Hostetter" <> wrote:
> : > Currently the char[] wins, but good point: seems like each setter
> : > should null out the other one?
> :
> : Certainly the String setter should null the char[] (that's the only
> : way to keep back compatibility), and probably vice-versa.
> i haven't really thought baout this before today (i missed seeing the
> char[] stuff get added to Token as well) but if we're confident the
> char[]
> stuff is hte direction we want to go, then i believe the cleanest forward
> migration plan is...
> 1) deprecate Token.termText, Token.getTermText(), Token.setTermText
> 2) make Token.setTermBuffer() null out Token.termText (document)
> 3) make Token.setTermText() null out Token.termBuffer
> 4) refactor all of the the "if (null == termBuffer)" logic in
> DocumentsWriter into a the Token class, ala...
>   public final char[] termBuffer() {
>     initTermBuffer();
>     return termBuffer;
>   }
>   public final int termBufferOffset() {
>     initTermBuffer();
>     return termBufferOffset;
>   }
>   public final int termBufferLength() {
>     initTermBuffer();
>     return termBufferLength;
>   }
>   private void initTermBuffer() {
>     if (null != termBuffer) return;
>     termBufferOffset=0;
>     termBufferLength = termText.length();
>     termBuffer = char[termBufferLength];
>     termText.getChars(0, termBufferLength, termBuffer, 0)
>   }
> ...such that DocumentsWRiter never uses termText just termBuffer
> 5) at some point down the road, modify all of the "core" TokenStreams to
> use termBuffer instead of termText
> 6) at some point way down the road, delete the depreacated
> methods/variables and the Token.initTermBuffer method.
> Unless I've missed something, the end result should be...
> a) existing TokenStreams that use termText exclusively and don't know
> anything about termBuffer will have the exact same performance
> characteristics that they currently do (a char[] will be created on
> demand
> the first time termBuffer is used -- by DocumentsWriter)
> b) TokenStreams which wind up being a mix of old and new code using both
> termText and termBuffer should work correctly in any combination.
> c) new TokenStreams that use termBuffer exclusively should work fine, and
> have decent performance even with the overhead of the initTermBuffer()
> call (which will get better once the deprecated legacy termText usage can
> be removed.

I like this approach Hoss!

I will open an issue and work on it; I'd like to finish up through
your step 5 below.  This way "out of the box" performance of Lucene is
improved, for people who use the core analyzers and filters.

To further improve "out of the box" performance I would really also
like to fix the core analyzers, when possible, to re-use a single
Token instance for each Token they produce.  This would then mean no
objects are created as you step through Tokens in the TokenStream
... so this would give the best performance.

The problem is, this would be a sneaky API change and would for
example break anyone who gathers the Tokens into an array and
processes them later (eg maybe highlighter does?).

Maybe one way to migrate to this would be to add a new method "Token
nextDirect()" to TokenStream which is allowed to return a Token that
may be recycled.  This means callers of nextDirect() must make their
own copy of the Token if they intend to keep it for later usage.  It
would default to "return next()" and I would then default "next()" to
call nextDirect() but make a full copy of what's
returned. DocumentsWriter would use this to step through the tokens.

Analyzers would then implement either next() or nextDirect().  We
would fix all core analyzers to implemente nextDirect(), and then all
other analyzers would remain backwards compatible.

>From a caller's standpoint the only difference between nextDirect()
and next() is that next() guarantees to give you a "full private copy"
of the Token but nextDirect() does not.

We could also punt on this entirely since it is always possible for
people to make their own analyzers that re-use Tokens, but I think
having decent "out of the box" performance with Lucene is important.
Ie, Lucene's defaults should be set up so that you get decent
performance without changing anything; you should not have to work
hard to get good performance and unfortunately today you still do....

> Side note: Token.toString() is current jacked in cases where termBuffer is
> used)

Woops -- sorry -- I will add test case & fix it.

> Note that there are many existing filters that directly access and
> manipulate the package protected String termText.  These will need to
> be changed.

Hmmm ... is it too late to make all of Token's package protected attrs
private, so we require use of getters?  Or, maybe just make the new
ones (termBuffer*) private and then leave termText package protected
but deprecate it and add a comment saying that core analyzers and
filters have switched to using termBuffer and so you may get an NPE if
you access termText directly?  Plus fix all core analyzers to not
directly access termText and use termBuffer instead...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message