Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 27750 invoked from network); 21 Nov 2007 12:07:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Nov 2007 12:07:17 -0000 Received: (qmail 51951 invoked by uid 500); 21 Nov 2007 12:06:58 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 51893 invoked by uid 500); 21 Nov 2007 12:06:58 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 51882 invoked by uid 99); 21 Nov 2007 12:06:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Nov 2007 04:06:58 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.202] (HELO spaceymail-a2.g.dreamhost.com) (208.97.132.202) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Nov 2007 12:06:47 +0000 Received: from diskbox.stolsvik.com (cm-84.215.32.37.getinternet.no [84.215.32.37]) by spaceymail-a2.g.dreamhost.com (Postfix) with ESMTP id 98F84EE339 for ; Wed, 21 Nov 2007 04:06:39 -0800 (PST) Received: from [10.10.10.10] by diskbox.stolsvik.com with esmtp (Exim 4.63) (envelope-from ) id 1IuoLl-0002CZ-VW for java-dev@lucene.apache.org; Wed, 21 Nov 2007 13:06:37 +0100 Message-ID: <47441F22.7010902@Stolsvik.com> Date: Wed, 21 Nov 2007 13:05:54 +0100 From: =?UTF-8?B?RW5kcmUgU3TDuGxzdmlr?= Organization: Picorg User-Agent: Thunderbird 2.0.0.9 (Windows/20071031) MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: new Token API References: <47422419.3020906@apache.org> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Yonik Seeley wrote: > On Nov 19, 2007 7:02 PM, Doug Cutting wrote: >> Yonik Seeley wrote: >>> 1) If we are deprecating some methods like String termText(), how >>> about at the same time deprecating "String type"? If we want >>> lightweight per-token metadata for communication between filters, an >>> int or a long used as a bitvector (32 or 64 independent boolean vars >>> per token) would be much more useful than a single String. >> There are tokenizers that use the type string, e.g., StandardFilter & >> similar things in Nutch. How would you replace such uses? Add a bit >> for each token type? Is that really that much more useful? > > It is, given that it enables a token to have more than one type at once. > The benefit is probably relatively minor (the number of people who > would use it), and I wouldn't have brought it up except that it could > piggy-back on the other recent changes to Token. I'm just a lurker! However, I'll chime in and say that this sounds interesting. But please use a long if you do such a thing - better to have some extra bits available for future, and given that most future lucenes' will run on 64 bit system, such a thing shouldn't give a performance impact. You could however use a String[], or use Set, to communicate (or potentially use "comma-separated values" in the one String, but this makes uniquely identifying your particular token somewhat messy). Will the restriction of 32 core bits and 32 user bits ever be a problem? What about completely different usages, like categorizing something into an indefinite number of bins? (Just to be the devil's advocate..) A Michael mentioned setting some reference to null, with the result being that GC kicked in more often. If this is the case for that particular scenario, then please don't optimize along those lines. Getting rid of your never-to-be-used-again objects as fast as possible is _always_ good, and if it in some strange situation seems opposite, then that will probably change radically in the next iteration of GC development - or for example by setting the huge bunch of GC selection and tuning parameters correct .. or something.. With that said, obviously reusing the char[] is the better way to go: not creating an object at all is of course better than dropping an object, then recreate the same thing moments afterwards. Have you run your profilers on this question? Seems like a prudent thing to do if you're in a situation where some API will change any way. Thanks for reading my ramblings, Endre. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org