lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <>
Subject Token implementation
Date Sun, 18 May 2008 22:04:20 GMT
I think the Token implementation as it stands can use some improvement  
and I'd be willing to do it. I'd like some input, though. Especially  
because it is core to Lucene.

I've been working on eliminating deprecations from my user code and I  
ran across Token.getText() as being deprecated.

This is not about my code, but the code in Token.

In Token, it allows one of two representations of the actual token,  
either an immutable String, or a mutable char[]. One can flip back and  
forth between these all to easily.

termText() is deprecated so termBuffer() is suggested as a replacement.

Calling termBuffer() will potentially copy the text out of the String  
termText and into the newly created buffer and return it.

Calling setTermText(str), which is not deprecated, will drop the  
buffer and save the str in termText.

It appears that the invariant that is trying to be established is  
either termText or termBuffer holds the content, but not both.  
However, termBuffer() potentially violates this by loading termText  
with the termBuffer, but not nulling out the buffer.

Also, in my code, I am not manipulating char[] so getting the buffer,  
I need to immediately convert it to a string to process it. And then  
when I'm done, I have a String possibly of some other length. To stuff  
it back into the termBuffer, requires a call to:
setTermBuffer(s.toCharArray(), o, s.length())

I was looking at this in light of TokenFilter's next(Token) method and  
how it was being used. In looking at the contrib filters, they have  
not been modified. Further, most of them, if they work with the  
content analysis and generation, do their work in strings. Some of  
these appear to be good candidates for using char[] rather than  
strings, such as the NGram filter. But others look like they'd just as  
well remain with String manipulation.

I'd like to suggest that internally, that Token be changed to only use  
char[] termBuffer and eliminate termText. And also, that termText be  
restored as not deprecated.

But, in TokenFilter, next() should be deprecated, IMHO.

The performance advantage is in reusing Tokens and their buffer.

I have also used a better algorithm than doubling for resizing an  
array. I'd have to hunt for it.

-- DM Smith, infrequent contributer, grateful user!

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message