lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <>
Subject Token termBuffer issues
Date Thu, 19 Jul 2007 22:23:16 GMT
I had previously missed the changes to Token that add support for
using an array (termBuffer):

+  // For better indexing speed, use termBuffer (and
+  // termBufferOffset/termBufferLength) instead of termText
+  // to save new'ing a String per token
+  char[] termBuffer;
+  int termBufferOffset;
+  int termBufferLength;

While I think this approach would have been best to start off with
rather than String,
I'm concerned that it will do little more than add overhead at this
point, resulting in slower code, not faster.

- If any tokenizer or token filter tries setting the termBuffer, any
downstream components would need to check for both.  It could be made
backward compatible by constructing a string on demand, but that will
really slow things down, unless the whole chain is converted to only
using the char[] somehow.

- It doesn't look like the indexing code currently pays any attention
to the char[], right?

- What if both the String and char[] are set?  A filter that doesn't
know better sets the String... this doesn't clear the char[]
currently, should it?



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message