Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 18820 invoked from network); 13 Aug 2008 19:05:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Aug 2008 19:05:21 -0000 Received: (qmail 342 invoked by uid 500); 13 Aug 2008 19:05:14 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 292 invoked by uid 500); 13 Aug 2008 19:05:14 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 283 invoked by uid 99); 13 Aug 2008 19:05:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Aug 2008 12:05:14 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Aug 2008 19:04:16 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 3E972234C192 for ; Wed, 13 Aug 2008 12:04:44 -0700 (PDT) Message-ID: <1354751673.1218654284241.JavaMail.jira@brutus> Date: Wed, 13 Aug 2008 12:04:44 -0700 (PDT) From: "DM Smith (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Issue Comment Edited: (LUCENE-1333) Token implementation needs improvements In-Reply-To: <2108817201.1215796831728.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622238#action_12622238 ] dmsmith edited comment on LUCENE-1333 at 8/13/08 12:04 PM: ------------------------------------------------------------ Regarding the implementation of hashCode: You are using the following: {code} private static int hashCode(int i) { return new Integer(i).hashCode(); } {code} This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary. While adding Token's integer values in Token's hashCode is perfectly fine, it is not quite optimal. And may cause unnecessary collisions. It might be better to pretend that Token's integer values are also in an array (using the ArrayUtil algorithm, this could be): {code} public int hashCode() { initTermBuffer(); int code = termLength; code = code * 31 + startOffset; code = code * 31 + endOffset; code = code * 31 + flags; code = code * 31 + positionIncrement; code = code * 31 + type.hashCode(); code = (payload == null ? code : code * 31 + payload.hashCode()); code = code * 31 + ArrayUtil.hashCode(termBuffer, 0, termLength); return code; } {code} Also, are the reinit methods used? If not, I'd like to work up a patch that uses them. (And I'll include the above in it.) (never mind. I see that they are! super! But I'm working up a patch for this and a couple of minor optimizations that affect Token) I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate. was (Author: dmsmith): Regarding the implementation of hashCode: You are using the following: {code} private static int hashCode(int i) { return new Integer(i).hashCode(); } {code} This is rather expensive. Integer.hashCode() merely returns its value. Constructing a new Integer is unnecessary. While adding Token's integer values in Token's hashCode is perfectly fine, it is not quite optimal. And may cause unnecessary collisions. It might be better to pretend that Token's integer values are also in an array (using the ArrayUtil algorithm, this could be): public int hashCode() { initTermBuffer(); int code = termLength; code = code * 31 + startOffset; code = code * 31 + endOffset; code = code * 31 + flags; code = code * 31 + positionIncrement; code = code * 31 + type.hashCode(); code = (payload == null ? code : code * 31 + payload.hashCode()); code = code * 31 + ArrayUtil.hashCode(termBuffer, 0, termLength); return code; } Also, are the reinit methods used? If not, I'd like to work up a patch that uses them. (And I'll include the above in it.) I'll probably add copyFrom(Token) as a means to initialize one token to have the same content as another. There are a couple of places that this is appropriate. > Token implementation needs improvements > --------------------------------------- > > Key: LUCENE-1333 > URL: https://issues.apache.org/jira/browse/LUCENE-1333 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 2.3.1 > Environment: All > Reporter: DM Smith > Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333.patch, LUCENE-1333a.txt > > > This was discussed in the thread (not sure which place is best to reference so here are two): > http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/%3C21F67CC2-EBB4-48A0-894E-FBA4AECC0D50@gmail.com%3E > or to see it all at once: > http://www.gossamer-threads.com/lists/lucene/java-dev/62851 > Issues: > 1. JavaDoc is insufficient, leading one to read the code to figure out how to use the class. > 2. Deprecations are incomplete. The constructors that take String as an argument and the methods that take and/or return String should *all* be deprecated. > 3. The allocation policy is too aggressive. With large tokens the resulting buffer can be over-allocated. A less aggressive algorithm would be better. In the thread, the Python example is good as it is computationally simple. > 4. The parts of the code that currently use Token's deprecated methods can be upgraded now rather than waiting for 3.0. As it stands, filter chains that alternate between char[] and String are sub-optimal. Currently, it is used in core by Query classes. The rest are in contrib, mostly in analyzers. > 5. Some internal optimizations can be done with regard to char[] allocation. > 6. TokenStream has next() and next(Token), next() should be deprecated, so that reuse is maximized and descendant classes should be rewritten to over-ride next(Token) > 7. Tokens are often stored as a String in a Term. It would be good to add constructors that took a Token. This would simplify the use of the two together. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org