lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Token implementation
Date Tue, 20 May 2008 09:09:10 GMT

Hiroaki Kawai wrote:
> "Michael McCandless" <> wrote:
>> I agree the situation is not ideal, and it's confusing.
>> This comes back to LUCENE-969.
>> At the time, we decided to keep both String & char[] only to avoid
>> performance cost for those analyzer chains that use String tokens
>> exclusively.
>> The idea was to allow Token to keep both text or char[] and sometimes
>> both (if they are storing the same characters, as happens if
>> termBuffer() is called when it's a String being stored)
>> Then, in 3.0, we would make the change you are proposing (to only
>> store char[] internally).  That was the plan, anyway.  Accelerating
>> this plan (to store only char[] today) is compelling, but I worry
>> about the performance hit to legacy analyzer chains...
> I'd like to suggest another implementation which use
> StringBuilder or CharBuffer instead of char[].

StringBuilder has to wait until we are on Java 1.5.

> Because we don't need to maintain the length separatly from the
> characater sequence itself.
> If we use char[], then we have to handle char[] and the offset and the
> sequence length, the method we implement will be so complex.
> I think those should be packed into one object.
> I did not test that using StringBuilder or CharBuffer hit the
> performance or not. But I think it might not result in so bad  
> performace.

I'm somewhat less optimistic here.  These classes are targeting use  
cases with much larger sequences of characters than a typical Token  
in a Document.  We should test the performance impact to see.

>> More responses below:
>> DM Smith <> wrote:
> -snip-
>>> I was looking at this in light of TokenFilter's next(Token)  
>>> method and how
>>> it was being used. In looking at the contrib filters, they have  
>>> not been
>>> modified. Further, most of them, if they work with the content  
>>> analysis and
>>> generation, do their work in strings. Some of these appear to be  
>>> good
>>> candidates for using char[] rather than strings, such as the  
>>> NGram filter.
>>> But others look like they'd just as well remain with String  
>>> manipulation.
>> It would be great to upgrade all contrib filters to use the re-use  
>> APIs.
> I'll contribute, too. :-)


>>> I'd like to suggest that internally, that Token be changed to  
>>> only use
>>> char[] termBuffer and eliminate termText.
>> The question is what performance cost we are incurring eg on the
>> contrib (& other) sources/filters?  Every time setTermText is called,
>> we copy out the chars (instead of holding a reference to the String).
>> Every time getText() is called we create a new String(...) from the
>> char[].  I think it's potentially a high cost, and so maybe we should
>> wait until 3.0 when we drop the deprecated APIs?
>>> And also, that termText be restored as not deprecated.
>> It made me nervous keeping this method because it looks like it  
>> should
>> be cheap to call, and in the past it was very cheap to call.  But,
>> maybe we could keep it, if we mark very very clearly in the javadocs
>> the performance cost you incur by using this method (it makes a new
>> String() every time)?
> I'd like to suggest changing the method definition to:
>  public void setTermText(CharSequence text)

This seems like a good idea.

>>> But, in TokenFilter, next() should be deprecated, IMHO.
>> I think this is a good idea.  After all if people don't want to  
>> bother
>> using the passed in Token, they are still allowed to return a new
>> one.
> I could not see what you meant. Can I ask you to let me know the  
> reason
> why it should be deprecated?

Deprecated in favor of next(Token result) API.  Ie, token sources/ 
filters should migrate to this re-use API.  It's a straightforward  
migration because the method next(Token result) is allowed to ignore  
result (and return its own Token) if it wants to.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message