Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 35565 invoked from network); 11 Jul 2008 12:33:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Jul 2008 12:33:03 -0000 Received: (qmail 68890 invoked by uid 500); 11 Jul 2008 12:33:02 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 68738 invoked by uid 500); 11 Jul 2008 12:33:01 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 68729 invoked by uid 99); 11 Jul 2008 12:33:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Jul 2008 05:33:01 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dmsmith555@gmail.com designates 74.125.44.29 as permitted sender) Received: from [74.125.44.29] (HELO yx-out-2324.google.com) (74.125.44.29) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Jul 2008 12:32:09 +0000 Received: by yx-out-2324.google.com with SMTP id 3so1021264yxj.5 for ; Fri, 11 Jul 2008 05:32:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:cc:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:references:x-mailer; bh=wCdHMGJKDfmQP899H+zhuHa/JVzCn8fnyQFDMwD10Gc=; b=Jvak63Sa4/X9tJ4DgcSf7afiw35ryk0mnYuTriq/S6ojzEsKsnk33h9UDn/lmDOMws H8V9cQ+xv3Q34Bc8FyFwqlChLNSNLW2lYw8INr0grtiFMFcFJH5c2vRvnrULm+DRaJx7 //TAdoqvFFMUMYlPtlovWOMHHRWPHS0ih2eNc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=cc:message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:references :x-mailer; b=P7zhxULYDYpvdEQK8P45V81INFqPqXn2iuJAzFinFedXDmsYrsfkdOknv41lwWFpcv MFfpvW71GFwC/eYVBwDXad0t3txvHo5sDCa/BhUiIfa6JIFiXbwRI/PkRR+E26/V1yMe ehwU1BAmI5PKrh4ohF5gJhaSszxFXVYamdRgo= Received: by 10.151.155.19 with SMTP id h19mr15845192ybo.36.1215779551103; Fri, 11 Jul 2008 05:32:31 -0700 (PDT) Received: from ?10.0.1.199? ( [24.210.174.243]) by mx.google.com with ESMTPS id y45sm794459pyg.9.2008.07.11.05.32.29 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 11 Jul 2008 05:32:30 -0700 (PDT) Cc: java-dev@lucene.apache.org Message-Id: From: DM Smith To: DM Smith In-Reply-To: <59490209-C4D9-4947-94FB-728F1A23F888@gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v926) Subject: Re: Token implementation Date: Fri, 11 Jul 2008 08:32:28 -0400 References: <21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com> <9ac0c6aa0805190244x4bd89d48x38e459d318ef4d58@mail.gmail.com> <48319A7C.5010505@gmail.com> <9ac0c6aa0805191333x158ae008l44514023960d2d49@mail.gmail.com> <0E0100FA-195F-4DC1-9B08-7837C44EB732@gmail.com> <59490209-C4D9-4947-94FB-728F1A23F888@gmail.com> X-Mailer: Apple Mail (2.926) X-Virus-Checked: Checked by ClamAV on apache.org I am now looking at this in depth. Here is what I am finding: Token and Term are tightly paired, but their implementation is not. Term holds a field and a word. Both of these are Strings. So to get the term out of a Token and put it into a Term, one of two constructs is used: t = new Term(field, token.termText()); or String text = new String(token.termBuffer(), 0, token.termLength()); t = new Term(field, text); Both of these are awkward. Shouldn't Term have constructors that take a Token? Because of reuse, these ctors will have to copy out the term text into it's own field. -- DM On May 20, 2008, at 5:21 AM, DM Smith wrote: > > On May 20, 2008, at 5:01 AM, Michael McCandless wrote: > >> >> DM Smith wrote: >>> >>> On May 19, 2008, at 4:33 PM, Michael McCandless wrote: >>> >>>> DM Smith wrote: >>>>> Michael McCandless wrote: >>>>>> >>>>>> I agree the situation is not ideal, and it's confusing. >>>>> >>>>> My problem as a user is that I have to read the code to figure >>>>> out how to >>>>> optimally use the class. The JavaDoc is a bit wanting. >>>> >>>> Yeah we should fix the javadocs. >>>> >>>>>> This comes back to LUCENE-969. >>>>>> >>>>>> At the time, we decided to keep both String & char[] only to >>>>>> avoid >>>>>> performance cost for those analyzer chains that use String tokens >>>>>> exclusively. >>>>>> >>>>>> The idea was to allow Token to keep both text or char[] and >>>>>> sometimes >>>>>> both (if they are storing the same characters, as happens if >>>>>> termBuffer() is called when it's a String being stored) >>>>> >>>>> When termBuffer() is called termText is always null on return. >>>>> This is the >>>>> invariant of initTermBuffer() which is called directly or >>>>> indirectly by >>>>> every method that touches termBuffer. >>>> >>>> Sorry I meant termText(). >>>> >>>>> It is only after calling termText() that one could have both. >>>>> The only >>>>> advantage I see here is that calling it twice without any >>>>> intervening call >>>>> to a termBuffer method would it return the same physical string. >>>> >>>> Right. >>>> >>>>> After calling setTermText(newText), termBuffer is set to null. >>>>> >>>>> I presume the purpose of a filter is to get the token content, >>>>> and if >>>>> modified, set the new content. If so, the result of the setXXX >>>>> will be that >>>>> either termText or termBuffer will be null. >>>> >>>> Right, though, if there's no change, and the next filter in the >>>> chain >>>> calls termText(), we save constructing a new String by caching it. >>>> >>>>>> Then, in 3.0, we would make the change you are proposing (to only >>>>>> store char[] internally). That was the plan, anyway. >>>>>> Accelerating >>>>>> this plan (to store only char[] today) is compelling, but I worry >>>>>> about the performance hit to legacy analyzer chains... >>>>> >>>>> I wonder whether it is all that significant an issue. Today, >>>>> some of the >>>>> Lucene analyzers have been migrated to using char[], while >>>>> others, notably >>>>> contrib, continue to use text. >>>>> >>>>> IMHO: Prior to char[], the text was replaceable, but not >>>>> modifiable. From a >>>>> practical perspective, Token reuse minimized the cost of >>>>> construction, but >>>>> not much else. The performance of a Token was predictable, but >>>>> the filter >>>>> was potentially sub-optimal. With char[] and supporting methods, >>>>> the text >>>>> became modifiable, too. >>>>> >>>>> When a filter calls setTermText or setTermBuffer, it does not >>>>> know how the >>>>> consumer of the token will work. It could be that it stores it >>>>> with >>>>> setTermText and the next filter calls termBuffer(). >>>>> >>>>> I may not understand this correctly, but it seems to me that the >>>>> following >>>>> is plausible given a filter chain of Lucene provided filters >>>>> (including contrib) >>>>> If we have a token filter chain of A -> B -> C, which uses >>>>> next() in any >>>>> part of the chain, the flow of a reusable Token is stopped. A >>>>> given filter >>>>> may cache a Token and reuse it. So consider the following >>>>> scenario: >>>>> A overrides next(Token) and reuses the token via char[] >>>>> B overrides next() and has a cached Token and updates text. >>>>> C overrides next(Token) and reuses the token via char[]. >>>>> >>>>> First run: >>>>> After A is done, the termText in the supplied Token is null and >>>>> termBuffer >>>>> has a value. >>>>> >>>>> B provides it's own token so it is not influenced by A. >>>>> >>>>> C is given the token that B returns, because the caller is >>>>> effectively using >>>>> "token = input.next(token)", but because the token has text, a >>>>> performance >>>>> hit is taken to put it into char[]. Both text and char[] start >>>>> out the same, >>>>> but because char[] is changed, termText is set to null. >>>>> >>>>> Second run: >>>>> A starts with a Token with a char[] because it is reusing the >>>>> token from the >>>>> last run or because it is using a localToken from the caller. If >>>>> it is a >>>>> localToken, then the scenario is as above. But if it is the end >>>>> result of >>>>> the first run, then A is re-using the token that is cached in B. >>>>> Since C >>>>> last modified it, it is char[]. >>>>> >>>>> B uses its cached Token, but it was modified by A to be char[] >>>>> with null >>>>> text. Now B takes a performance hit as it creates a new String. >>>>> >>>>> C is as in the first run. >>>>> >>>>> Another scenario: >>>>> A, B and C are all legacy. This would only be filter chains that >>>>> are not >>>>> provided by core Lucene as the core filter chains have been >>>>> migrated. This >>>>> would be a performance hit. >>>> >>>> Why is this a performance hit? If they are all legacy they would >>>> all >>>> implemented the non-reuse next() API, and would use setTermText, >>>> and >>>> no conversion to char[] would happen (except inside >>>> DocumentsWriter)? >>> >>> I was thinking faster than I was typing. I meant to say "not be a >>> performance hit". >> >> Ahh, ok. >> >>>> >>>> >>>>> A last scenario: >>>>> A, B and C are all char[]. This would not take a performance hit. >>>>> >>>>> It seems to me that in a mixed chain, that there will always be a >>>>> performance hit. >>>> >>>> Right, though since we cache the String we should save on multiple >>>> calls to termText(). And in a non-mixed chain (all new or all >>>> old) I >>>> think there wouldn't be a hit. >>>> >>>>> But my guess is that maintaining termBuffer once used would >>>>> be a good thing. >>>> >>>> Interesting... >>>> >>>>> So, a modified suggestion to maintain the performance but >>>>> improve the first >>>>> scenario. Do you see any problem with the following? >>>>> >>>>> If termBuffer is used in a token, then it is maintained and >>>>> never set to null. >>>>> Note also that resizeTermBuffer(size) maintains the termBuffer. >>>>> That is, it >>>>> copies the text when the array is grown. When it is known that >>>>> it is going >>>>> to be slammed this is unnecessary. One can implement a helper >>>>> function that >>>>> merely grows the array. There are a couple of places this >>>>> >>>>> Thus setTokenText would be something like: >>>>> public void setTermText(String text) { >>>>> termText = text; >>>>> if (termBuffer != null) { >>>>> growTermBuffer(termText.length()); >>>>> termText.getChars(0, termText.length(), termBuffer, 0); >>>>> } >>>>> } >>>>> >>>>> A possible implementation to grow the array would be: >>>>> private void growTermBuffer(int newSize) >>>>> { >>>>> // determine the best size >>>>> if (newSize < MIN_BUFFER_SIZE) { >>>>> newSize = MIN_BUFFER_SIZE; >>>>> } >>>>> >>>>> // if the buffer exists and is too small, then determine a >>>>> better size. >>>>> // this is the current doubling algorithm. it could be better. >>>>> int tbLength = termBuffer == null ? 0 : termBuffer.length; >>>>> if (tbLength > 0 && newSize > tbLength) { >>>>> int size = tbLength; >>>>> while (size < newSize) { >>>>> size *= 2; >>>>> } >>>>> newSize = size; >>>>> } >>>>> >>>>> // Check to see if the buffer needs to be resized >>>>> if (newSize > tbLength) >>>>> { >>>>> termBuffer = new char[newSize]; >>>>> } >>>>> } >>>> >>>> That looks good. Though, if Token is 100% re-used (full chain is >>>> "new") then I would expect growing the char[] to be very low cost >>>> (happens very few times). >>> >>> Without peeking at your Python sample below ;) the doubling >>> algorithm is costly as it over allocates too much. But it has the >>> nice behavior that reallocations are fewer. >>> >>> But, now peeking ahead, a less aggressive allocation policy would >>> cause more allocations and the array copy becomes more important. >>> Though not significantly. But in a mixed environment with some >>> sharing, it could be magnified. >>> >>> >>> I didn't dig into it, but in the core, are reusable Tokens per >>> document or per DocumentWriter. If it is the former then the copy >>> becomes more significant. If it is the latter, then it is almost >>> moot, as you point out. >> >> It's actually per DocumentsWriter per thread per field. > > This is excellent. > >> >> >>>> >>>> >>>>> More below.... >>>>>> >>>>>> More responses below: >>>>>> >>>>>> DM Smith wrote: >>>>>> >>>>>>> >>>>>>> I think the Token implementation as it stands can use some >>>>>>> improvement and >>>>>>> I'd be willing to do it. I'd like some input, though. >>>>>>> Especially because it >>>>>>> is core to Lucene. >>>>>>> >>>>>>> I've been working on eliminating deprecations from my user >>>>>>> code and I ran >>>>>>> across Token.getText() as being deprecated. >>>>>>> >>>>>>> This is not about my code, but the code in Token. >>>>>>> >>>>>>> In Token, it allows one of two representations of the actual >>>>>>> token, either >>>>>>> an immutable String, or a mutable char[]. One can flip back >>>>>>> and forth >>>>>>> between these all to easily. >>>>>>> >>>>>>> termText() is deprecated so termBuffer() is suggested as a >>>>>>> replacement. >>>>>>> >>>>>>> Calling termBuffer() will potentially copy the text out of the >>>>>>> String >>>>>>> termText and into the newly created buffer and return it. >>>>>>> >>>>>>> Calling setTermText(str), which is not deprecated, will drop >>>>>>> the buffer and >>>>>>> save the str in termText. >>>>>>> >>>>>>> It appears that the invariant that is trying to be established >>>>>>> is >>>>>>> either termText or termBuffer holds the content, but not both. >>>>>>> However, termBuffer() potentially violates this by loading >>>>>>> termText >>>>>>> with the termBuffer, but not nulling out the buffer. >>>>>>> >>>>>> >>>>>> Actually, both are allowed to be set, as long as they are the >>>>>> same. >>>>>> So termBuffer() is allowed to leave both non-null. >>>>>> >>>>> >>>>> I'm not sure of the advantage. >>>> >>>> Covered above (mixed chains). >>>> >>>>>>> Also, in my code, I am not manipulating char[] so getting the >>>>>>> buffer, I need >>>>>>> to immediately convert it to a string to process it. And then >>>>>>> when I'm done, >>>>>>> I have a String possibly of some other length. To stuff it >>>>>>> back into the >>>>>>> termBuffer, requires a call to: >>>>>>> setTermBuffer(s.toCharArray(), o, s.length()) >>>>>> >>>>>> It would be better to call Token.resizeTermBuffer(...), then >>>>>> s.getChars into the Token's term buffer (saves a buffer copy). >>>>> >>>>> Many thanks. I saw the comment, but missed getChars. This is >>>>> better, but >>>>> even better would be growTermBuffer (above) since resizeTermBuffer >>>>> potentially has an unnecessary copy. >>>> >>>> True, but remember growing should be very rare in 100% reuse >>>> (new) case. >>> >>> I'm presuming then that the DocumentWriter is the holder of the >>> one reusable Token. >>> >>>> >>>> >>>>>>> I was looking at this in light of TokenFilter's next(Token) >>>>>>> method and how >>>>>>> it was being used. In looking at the contrib filters, they >>>>>>> have not been >>>>>>> modified. Further, most of them, if they work with the content >>>>>>> analysis and >>>>>>> generation, do their work in strings. Some of these appear to >>>>>>> be good >>>>>>> candidates for using char[] rather than strings, such as the >>>>>>> NGram filter. >>>>>>> But others look like they'd just as well remain with String >>>>>>> manipulation. >>>>>> >>>>>> It would be great to upgrade all contrib filters to use the re- >>>>>> use APIs. >>>>> >>>>> I'll be happy to work toward that end. I think it affects my >>>>> performance >>>>> today. >>>> >>>> That would be great :) >>>> >>>>>>> I'd like to suggest that internally, that Token be changed to >>>>>>> only use >>>>>>> char[] termBuffer and eliminate termText. >>>>>> >>>>>> The question is what performance cost we are incurring eg on the >>>>>> contrib (& other) sources/filters? Every time setTermText is >>>>>> called, >>>>>> we copy out the chars (instead of holding a reference to the >>>>>> String). >>>>>> Every time getText() is called we create a new String(...) from >>>>>> the >>>>>> char[]. I think it's potentially a high cost, and so maybe we >>>>>> should >>>>>> wait until 3.0 when we drop the deprecated APIs? >>>>> >>>>> See above. I'll concede that. But I think that once termBuffer >>>>> is used >>>>> because of a mixed chain, it should be retained. >>>> >>>> This may cause problems if a core Tokenizer is used to produce the >>>> tokens, but then a big chain of old (non-reuse) filters is used >>>> after >>>> that? >>> >>> If the reusable token is per DocumentWriter and the chain is per >>> Document, then it is an advantage to retain it. he buffer is there >>> for the next Document, likely big enough that it does not need to >>> be allocated again. >> >> Well, remember that a single non-reuse filter in the chain "breaks" >> the sharing, going backwards. IE the new token source will see a >> different Token with every next() call. So the buffer doesn't get >> reused. So I think in this case we would spend alot of extra time >> converting the Strings in the non-reuse filters back to char[]. > > IIRC, I saw somewhere, in an outer driving loop something like the > following: > > result = input.next(localToken) > > If my memory is correct, then reuse is re-established each time > through the loop. That localToken is presumably using char[] to set > initial content. Or at least once all core and contrib is converted > to work with it. > > >> >> >>>> >>>> >>>>>>> And also, that termText be restored as not deprecated. >>>>>> >>>>>> It made me nervous keeping this method because it looks like it >>>>>> should >>>>>> be cheap to call, and in the past it was very cheap to call. >>>>>> But, >>>>>> maybe we could keep it, if we mark very very clearly in the >>>>>> javadocs >>>>>> the performance cost you incur by using this method (it makes a >>>>>> new >>>>>> String() every time)? >>>>>> >>>>>> >>>>>>> But, in TokenFilter, next() should be deprecated, IMHO. >>>>>> >>>>>> I think this is a good idea. After all if people don't want to >>>>>> bother >>>>>> using the passed in Token, they are still allowed to return a new >>>>>> one. >>>>> >>>>> Should the constructors in Token that take a String be >>>>> deprecated, too? The >>>>> comments seem to suggest that. >>>> >>>> Actually I think they should. We should strongly encourage the >>>> re-use >>>> APIs. >>>> >>>>>>> I have also used a better algorithm than doubling for resizing >>>>>>> an >>>>>>> array. I'd have to hunt for it. >>>>>> >>>>>> That would be great! >>>>> >>>>> I'm looking, but still haven't found it. :) >>>>> >>>>> I'm looking into digging into this one. >>>> >>>> Python has an interesting approach, for its list type. Here's a >>>> snippet from listobject.c from Python's sources: >>>> >>>> /* This over-allocates proportional to the list size, making room >>>> * for additional growth. The over-allocation is mild, but is >>>> * enough to give linear-time amortized behavior over a long >>>> * sequence of appends() in the presence of a poorly-performing >>>> * system realloc(). >>>> * The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, >>>> 88, ... >>>> */ >>>> new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6) + newsize; >>> >>> This is a good one. It is computationally simple and it is not too >>> aggressive. >>> >>> >>>> >>>> >>>> I'm sure Perl has something interesting too :) But I think this is >>>> probably overkill for us ... >>>> >>>>> Based on the outcome of this discussion, would this be one Jira >>>>> issue? >>>>> Reopening LUCENE-969? A separate issue for the contrib changes? >>>> >>>> I would say open a new issue? >>> >>> OK. I'm on vacation for the next week and I don't know what kind >>> of connectivity I'll have. So it might wait a bit. >>> >>>> >>>> >>>>> Since this would not be an API change, would there need to be >>>>> changes to >>>>> test cases? If so, what would you suggest? >>>> >>>> I think the existing test cases are likely sufficient though I >>>> always >>>> love adding new ones ;) >>> >>> Me too ;) >>> >>> -- DM >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-dev-help@lucene.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org