lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Token termBuffer issues
Date Fri, 20 Jul 2007 23:28:31 GMT

: > Currently the char[] wins, but good point: seems like each setter
: > should null out the other one?
: Certainly the String setter should null the char[] (that's the only
: way to keep back compatibility), and probably vice-versa.

i haven't really thought baout this before today (i missed seeing the
char[] stuff get added to Token as well) but if we're confident the char[]
stuff is hte direction we want to go, then i believe the cleanest forward
migration plan is...

1) deprecate Token.termText, Token.getTermText(), Token.setTermText
2) make Token.setTermBuffer() null out Token.termText (document)
3) make Token.setTermText() null out Token.termBuffer
4) refactor all of the the "if (null == termBuffer)" logic in
DocumentsWriter into a the Token class, ala...
  public final char[] termBuffer() {
    return termBuffer;
  public final int termBufferOffset() {
    return termBufferOffset;
  public final int termBufferLength() {
    return termBufferLength;
  private void initTermBuffer() {
    if (null != termBuffer) return;
    termBufferLength = termText.length();
    termBuffer = char[termBufferLength];
    termText.getChars(0, termBufferLength, termBuffer, 0)
...such that DocumentsWRiter never uses termText just termBuffer
5) at some point down the road, modify all of the "core" TokenStreams to
use termBuffer instead of termText
6) at some point way down the road, delete the depreacated
methods/variables and the Token.initTermBuffer method.

Unless I've missed something, the end result should be...

a) existing TokenStreams that use termText exclusively and don't know
anything about termBuffer will have the exact same performance
characteristics that they currently do (a char[] will be created on demand
the first time termBuffer is used -- by DocumentsWriter)

b) TokenStreams which wind up being a mix of old and new code using both
termText and termBuffer should work correctly in any combination.

c) new TokenStreams that use termBuffer exclusively should work fine, and
have decent performance even with the overhead of the initTermBuffer()
call (which will get better once the deprecated legacy termText usage can
be removed.

Side note: Token.toString() is current jacked in cases where termBuffer is


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message