lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <>
Subject Re: Token implementation
Date Fri, 11 Jul 2008 12:32:28 GMT
I am now looking at this in depth.
Here is what I am finding: Token and Term are tightly paired, but  
their implementation is not.

Term holds a field and a word. Both of these are Strings. So to get  
the term out of a Token and put it into a Term, one of two constructs  
is used:
t = new Term(field, token.termText());
String text = new String(token.termBuffer(), 0, token.termLength());
t = new Term(field, text);

Both of these are awkward. Shouldn't Term have constructors that take  
a Token? Because of reuse, these ctors will have to copy out the term  
text into it's own field.

-- DM

On May 20, 2008, at 5:21 AM, DM Smith wrote:

> On May 20, 2008, at 5:01 AM, Michael McCandless wrote:
>> DM Smith wrote:
>>> On May 19, 2008, at 4:33 PM, Michael McCandless wrote:
>>>> DM Smith <> wrote:
>>>>> Michael McCandless wrote:
>>>>>> I agree the situation is not ideal, and it's confusing.
>>>>> My problem as a user is that I have to read the code to figure  
>>>>> out how to
>>>>> optimally use the class. The JavaDoc is a bit wanting.
>>>> Yeah we should fix the javadocs.
>>>>>> This comes back to LUCENE-969.
>>>>>> At the time, we decided to keep both String & char[] only to
>>>>>> avoid
>>>>>> performance cost for those analyzer chains that use String tokens
>>>>>> exclusively.
>>>>>> The idea was to allow Token to keep both text or char[] and  
>>>>>> sometimes
>>>>>> both (if they are storing the same characters, as happens if
>>>>>> termBuffer() is called when it's a String being stored)
>>>>> When termBuffer() is called termText is always null on return.  
>>>>> This is the
>>>>> invariant of initTermBuffer() which is called directly or  
>>>>> indirectly by
>>>>> every method that touches termBuffer.
>>>> Sorry I meant termText().
>>>>> It is only after calling termText() that one could have both.  
>>>>> The only
>>>>> advantage I see here is that calling it twice without any  
>>>>> intervening call
>>>>> to a termBuffer method would it return the same physical string.
>>>> Right.
>>>>> After calling setTermText(newText), termBuffer is set to null.
>>>>> I presume the purpose of a filter is to get the token content,  
>>>>> and if
>>>>> modified, set the new content. If so, the result of the setXXX  
>>>>> will be that
>>>>> either termText or termBuffer will be null.
>>>> Right, though, if there's no change, and the next filter in the  
>>>> chain
>>>> calls termText(), we save constructing a new String by caching it.
>>>>>> Then, in 3.0, we would make the change you are proposing (to only
>>>>>> store char[] internally).  That was the plan, anyway.   
>>>>>> Accelerating
>>>>>> this plan (to store only char[] today) is compelling, but I worry
>>>>>> about the performance hit to legacy analyzer chains...
>>>>> I wonder whether it is all that significant an issue. Today,  
>>>>> some of the
>>>>> Lucene analyzers have been migrated to using char[], while  
>>>>> others, notably
>>>>> contrib, continue to use text.
>>>>> IMHO: Prior to char[], the text was replaceable, but not  
>>>>> modifiable. From a
>>>>> practical perspective, Token reuse minimized the cost of  
>>>>> construction, but
>>>>> not much else. The performance of a Token was predictable, but  
>>>>> the filter
>>>>> was potentially sub-optimal. With char[] and supporting methods,  
>>>>> the text
>>>>> became modifiable, too.
>>>>> When a filter calls setTermText or setTermBuffer, it does not  
>>>>> know how the
>>>>> consumer of the token will work. It could be that it stores it  
>>>>> with
>>>>> setTermText and the next filter calls termBuffer().
>>>>> I may not understand this correctly, but it seems to me that the  
>>>>> following
>>>>> is plausible given a filter chain of Lucene provided filters  
>>>>> (including contrib)
>>>>> If we have a token filter chain of A -> B -> C, which uses  
>>>>> next() in any
>>>>> part of the chain, the flow of a reusable Token is stopped. A  
>>>>> given filter
>>>>> may cache a Token and reuse it. So consider the following  
>>>>> scenario:
>>>>> A overrides next(Token) and reuses the token via char[]
>>>>> B overrides next() and has a cached Token and updates text.
>>>>> C overrides next(Token) and reuses the token via char[].
>>>>> First run:
>>>>> After A is done, the termText in the supplied Token is null and  
>>>>> termBuffer
>>>>> has a value.
>>>>> B provides it's own token so it is not influenced by A.
>>>>> C is given the token that B returns, because the caller is  
>>>>> effectively using
>>>>> "token =", but because the token has text, a  
>>>>> performance
>>>>> hit is taken to put it into char[]. Both text and char[] start  
>>>>> out the same,
>>>>> but because char[] is changed, termText is set to null.
>>>>> Second run:
>>>>> A starts with a Token with a char[] because it is reusing the  
>>>>> token from the
>>>>> last run or because it is using a localToken from the caller. If  
>>>>> it is a
>>>>> localToken, then the scenario is as above. But if it is the end  
>>>>> result of
>>>>> the first run, then A is re-using the token that is cached in B.  
>>>>> Since C
>>>>> last modified it, it is char[].
>>>>> B uses its cached Token, but it was modified by A to be char[]  
>>>>> with null
>>>>> text. Now B takes a performance hit as it creates a new String.
>>>>> C is as in the first run.
>>>>> Another scenario:
>>>>> A, B and C are all legacy. This would only be filter chains that  
>>>>> are not
>>>>> provided by core Lucene as the core filter chains have been  
>>>>> migrated. This
>>>>> would be a performance hit.
>>>> Why is this a performance hit?  If they are all legacy they would  
>>>> all
>>>> implemented the non-reuse next() API, and would use setTermText,  
>>>> and
>>>> no conversion to char[] would happen (except inside  
>>>> DocumentsWriter)?
>>> I was thinking faster than I was typing. I meant to say "not be a  
>>> performance hit".
>> Ahh, ok.
>>>>> A last scenario:
>>>>> A, B and C are all char[]. This would not take a performance hit.
>>>>> It seems to me that in a mixed chain, that there will always be a
>>>>> performance hit.
>>>> Right, though since we cache the String we should save on multiple
>>>> calls to termText().  And in a non-mixed chain (all new or all  
>>>> old) I
>>>> think there wouldn't be a hit.
>>>>> But my guess is that maintaining termBuffer once used would
>>>>> be a good thing.
>>>> Interesting...
>>>>> So, a modified suggestion to maintain the performance but  
>>>>> improve the first
>>>>> scenario. Do you see any problem with the following?
>>>>> If termBuffer is used in a token, then it is maintained and  
>>>>> never set to  null.
>>>>> Note also that resizeTermBuffer(size) maintains the termBuffer.  
>>>>> That is, it
>>>>> copies the text when the array is grown. When it is known that  
>>>>> it is going
>>>>> to be slammed this is unnecessary. One can implement a helper  
>>>>> function that
>>>>> merely grows the array. There are a couple of places this
>>>>> Thus setTokenText would be something like:
>>>>> public void setTermText(String text) {
>>>>> termText = text;
>>>>> if (termBuffer != null) {
>>>>>  growTermBuffer(termText.length());
>>>>>  termText.getChars(0, termText.length(), termBuffer, 0);
>>>>> }
>>>>> }
>>>>> A possible implementation to grow the array would be:
>>>>> private void growTermBuffer(int newSize)
>>>>> {
>>>>> // determine the best size
>>>>> if (newSize < MIN_BUFFER_SIZE) {
>>>>> newSize = MIN_BUFFER_SIZE;
>>>>> }
>>>>> // if the buffer exists and is too small, then determine a  
>>>>> better size.
>>>>> // this is the current doubling algorithm. it could be better.
>>>>> int tbLength = termBuffer == null ? 0 : termBuffer.length;
>>>>> if (tbLength > 0 && newSize > tbLength) {
>>>>> int size = tbLength;
>>>>> while (size < newSize) {
>>>>>   size *= 2;
>>>>> }
>>>>> newSize = size;
>>>>> }
>>>>> // Check to see if the buffer needs to be resized
>>>>> if (newSize > tbLength)
>>>>> {
>>>>> termBuffer = new char[newSize];
>>>>> }
>>>>> }
>>>> That looks good.  Though, if Token is 100% re-used (full chain is
>>>> "new") then I would expect growing the char[] to be very low cost
>>>> (happens very few times).
>>> Without peeking at your Python sample below ;) the doubling  
>>> algorithm is costly as it over allocates too much. But it has the  
>>> nice behavior that reallocations are fewer.
>>> But, now peeking ahead, a less aggressive allocation policy would  
>>> cause more allocations and the array copy becomes more important.  
>>> Though not significantly. But in a mixed environment with some  
>>> sharing, it could be magnified.
>>> I didn't dig into it, but in the core, are reusable Tokens per  
>>> document or per DocumentWriter. If it is the former then the copy  
>>> becomes more significant. If it is the latter, then it is almost  
>>> moot, as you point out.
>> It's actually per DocumentsWriter per thread per field.
> This is excellent.
>>>>> More below....
>>>>>> More responses below:
>>>>>> DM Smith <> wrote:
>>>>>>> I think the Token implementation as it stands can use some  
>>>>>>> improvement and
>>>>>>> I'd be willing to do it. I'd like some input, though.  
>>>>>>> Especially because it
>>>>>>> is core to Lucene.
>>>>>>> I've been working on eliminating deprecations from my user  
>>>>>>> code and I ran
>>>>>>> across Token.getText() as being deprecated.
>>>>>>> This is not about my code, but the code in Token.
>>>>>>> In Token, it allows one of two representations of the actual
>>>>>>> token, either
>>>>>>> an immutable String, or a mutable char[]. One can flip back 

>>>>>>> and forth
>>>>>>> between these all to easily.
>>>>>>> termText() is deprecated so termBuffer() is suggested as a  
>>>>>>> replacement.
>>>>>>> Calling termBuffer() will potentially copy the text out of the
>>>>>>> String
>>>>>>> termText and into the newly created buffer and return it.
>>>>>>> Calling setTermText(str), which is not deprecated, will drop
>>>>>>> the buffer and
>>>>>>> save the str in termText.
>>>>>>> It appears that the invariant that is trying to be established
>>>>>>> is
>>>>>>> either termText or termBuffer holds the content, but not both.
>>>>>>> However, termBuffer() potentially violates this by loading  
>>>>>>> termText
>>>>>>> with the termBuffer, but not nulling out the buffer.
>>>>>> Actually, both are allowed to be set, as long as they are the  
>>>>>> same.
>>>>>> So termBuffer() is allowed to leave both non-null.
>>>>> I'm not sure of the advantage.
>>>> Covered above (mixed chains).
>>>>>>> Also, in my code, I am not manipulating char[] so getting the
>>>>>>> buffer, I need
>>>>>>> to immediately convert it to a string to process it. And then
>>>>>>> when I'm done,
>>>>>>> I have a String possibly of some other length. To stuff it  
>>>>>>> back into the
>>>>>>> termBuffer, requires a call to:
>>>>>>> setTermBuffer(s.toCharArray(), o, s.length())
>>>>>> It would be better to call Token.resizeTermBuffer(...), then
>>>>>> s.getChars into the Token's term buffer (saves a buffer copy).
>>>>> Many thanks. I saw the comment, but missed getChars. This is  
>>>>> better, but
>>>>> even better would be growTermBuffer (above) since resizeTermBuffer
>>>>> potentially has an unnecessary copy.
>>>> True, but remember growing should be very rare in 100% reuse  
>>>> (new) case.
>>> I'm presuming then that the DocumentWriter is the holder of the  
>>> one reusable Token.
>>>>>>> I was looking at this in light of TokenFilter's next(Token) 

>>>>>>> method and how
>>>>>>> it was being used. In looking at the contrib filters, they  
>>>>>>> have not been
>>>>>>> modified. Further, most of them, if they work with the content
>>>>>>> analysis and
>>>>>>> generation, do their work in strings. Some of these appear to
>>>>>>> be good
>>>>>>> candidates for using char[] rather than strings, such as the
>>>>>>> NGram filter.
>>>>>>> But others look like they'd just as well remain with String 

>>>>>>> manipulation.
>>>>>> It would be great to upgrade all contrib filters to use the re- 
>>>>>> use APIs.
>>>>> I'll be happy to work toward that end. I think it affects my  
>>>>> performance
>>>>> today.
>>>> That would be great :)
>>>>>>> I'd like to suggest that internally, that Token be changed to
>>>>>>> only use
>>>>>>> char[] termBuffer and eliminate termText.
>>>>>> The question is what performance cost we are incurring eg on the
>>>>>> contrib (& other) sources/filters?  Every time setTermText is
>>>>>> called,
>>>>>> we copy out the chars (instead of holding a reference to the  
>>>>>> String).
>>>>>> Every time getText() is called we create a new String(...) from 

>>>>>> the
>>>>>> char[].  I think it's potentially a high cost, and so maybe we  
>>>>>> should
>>>>>> wait until 3.0 when we drop the deprecated APIs?
>>>>> See above. I'll concede that. But I think that once termBuffer  
>>>>> is used
>>>>> because of a mixed chain, it should be retained.
>>>> This may cause problems if a core Tokenizer is used to produce the
>>>> tokens, but then a big chain of old (non-reuse) filters is used  
>>>> after
>>>> that?
>>> If the reusable token is per DocumentWriter and the chain is per  
>>> Document, then it is an advantage to retain it. he buffer is there  
>>> for the next Document, likely big enough that it does not need to  
>>> be allocated again.
>> Well, remember that a single non-reuse filter in the chain "breaks"  
>> the sharing, going backwards.  IE the new token source will see a  
>> different Token with every next() call.  So the buffer doesn't get  
>> reused.  So I think in this case we would spend alot of extra time  
>> converting the Strings in the non-reuse filters back to char[].
> IIRC, I saw somewhere, in an outer driving loop something like the  
> following:
> result =
> If my memory is correct, then reuse is re-established each time  
> through the loop. That localToken is presumably using char[] to set  
> initial content. Or at least once all core and contrib is converted  
> to work with it.
>>>>>>> And also, that termText be restored as not deprecated.
>>>>>> It made me nervous keeping this method because it looks like it 

>>>>>> should
>>>>>> be cheap to call, and in the past it was very cheap to call.   
>>>>>> But,
>>>>>> maybe we could keep it, if we mark very very clearly in the  
>>>>>> javadocs
>>>>>> the performance cost you incur by using this method (it makes a 

>>>>>> new
>>>>>> String() every time)?
>>>>>>> But, in TokenFilter, next() should be deprecated, IMHO.
>>>>>> I think this is a good idea.  After all if people don't want to 

>>>>>> bother
>>>>>> using the passed in Token, they are still allowed to return a new
>>>>>> one.
>>>>> Should the constructors in Token that take a String be  
>>>>> deprecated, too? The
>>>>> comments seem to suggest that.
>>>> Actually I think they should.  We should strongly encourage the  
>>>> re-use
>>>> APIs.
>>>>>>> I have also used a better algorithm than doubling for resizing
>>>>>>> an
>>>>>>> array. I'd have to hunt for it.
>>>>>> That would be great!
>>>>> I'm looking, but still haven't found it. :)
>>>>> I'm looking into digging into this one.
>>>> Python has an interesting approach, for its list type.  Here's a
>>>> snippet from listobject.c from Python's sources:
>>>> 	/* This over-allocates proportional to the list size, making room
>>>> 	 * for additional growth.  The over-allocation is mild, but is
>>>> 	 * enough to give linear-time amortized behavior over a long
>>>> 	 * sequence of appends() in the presence of a poorly-performing
>>>> 	 * system realloc().
>>>> 	 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72,  
>>>> 88, ...
>>>> 	 */
>>>> 	new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6) + newsize;
>>> This is a good one. It is computationally simple and it is not too  
>>> aggressive.
>>>> I'm sure Perl has something interesting too :)  But I think this is
>>>> probably overkill for us ...
>>>>> Based on the outcome of this discussion, would this be one Jira  
>>>>> issue?
>>>>> Reopening LUCENE-969? A separate issue for the contrib changes?
>>>> I would say open a new issue?
>>> OK. I'm on vacation for the next week and I don't know what kind  
>>> of connectivity I'll have. So it might wait a bit.
>>>>> Since this would not be an API change, would there need to be  
>>>>> changes to
>>>>> test cases? If so, what would you suggest?
>>>> I think the existing test cases are likely sufficient though I  
>>>> always
>>>> love adding new ones ;)
>>> Me too ;)
>>> -- DM
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message