Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 52317 invoked from network); 13 Nov 2008 15:49:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Nov 2008 15:49:14 -0000 Received: (qmail 81190 invoked by uid 500); 13 Nov 2008 15:49:18 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 81147 invoked by uid 500); 13 Nov 2008 15:49:18 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 81138 invoked by uid 99); 13 Nov 2008 15:49:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Nov 2008 07:49:18 -0800 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.202] (HELO spunkymail-a2.g.dreamhost.com) (208.97.132.202) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Nov 2008 15:47:58 +0000 Received: from [192.168.0.3] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a2.g.dreamhost.com (Postfix) with ESMTP id 358CA8713F for ; Thu, 13 Nov 2008 07:48:11 -0800 (PST) Message-Id: <21E33FF4-77B7-4AA9-AE7F-4C22D2737DC9@apache.org> From: Grant Ingersoll To: java-dev@lucene.apache.org In-Reply-To: <786fde50811130714i2c6fd9a9jafb51cf2c26c0175@mail.gmail.com> Content-Type: multipart/alternative; boundary=Apple-Mail-6320--824012916 Mime-Version: 1.0 (Apple Message framework v929.2) Subject: Re: Add Token.copyInto(Token) API Date: Thu, 13 Nov 2008 10:48:10 -0500 References: <786fde50811121051q21d13d4fmef4cbbb3ab2aec3c@mail.gmail.com> <786fde50811130714i2c6fd9a9jafb51cf2c26c0175@mail.gmail.com> X-Mailer: Apple Mail (2.929.2) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-6320--824012916 Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Don't want to discourage you from contributing it, just suggesting you may want to make sure the people working on that patch are aware of the issue such that maybe it can be addressed. On Nov 13, 2008, at 10:14 AM, Shai Erera wrote: > Thanks. I am aware of this thread. Indeed it will change the way > TokenStreams are handled, and so copying a Token may not be > necessary. However, I can't tell now whether this won't be necessary > - I guess I'll just have to wait until it's out and I start using > it :-) > > Anyway, I've implemented it for myself, and thought this might be a > nice contribution. I can live without it in Lucene :-) > > Thanks > Shai > > On Wed, Nov 12, 2008 at 10:02 PM, Grant Ingersoll > wrote: > Are you aware of LUCENE-1422? There is likely going to be a new way > of dealing w/ TokenStreams all together, so you might want to have a > look there before continuing. > > > On Nov 12, 2008, at 1:51 PM, Shai Erera wrote: > > Hi, > > I was thinking about adding a copyInto method to Token. The only way > to clone a token is by using its clone() or clone(char[], int, int, > int, int) methods. Both do the job, but allocate a Token instance. > While in 2.4 a Token constructor may actually get a char[] as input > (thus saving a char[] allocation), but it still allocates an instance. > > Even though the instance allocation is not that expensive, it does > allocate additional things, like String for the type, Payload and > String (for the text, even though that will be removed in 3.0). > If an application wishes to keep one instance of Token around, and > copy into it other Tokens, it can call various methods to achieve > that, like setTermBuffer, setOffset etc. A copyInto is just a > convenient method for doing that. > > If you wonder about the use case, then here it is: I know that it's > advised to reuse the same Token instance in the TokenStream API > (basically make sure to call next(Token). But there might be > TokenFilters which will need to save a certain occurrance of a > token, do some processing and return it later. A good example is > StemmingFilter. One can think of such a filter to return the > original token in addition to the stemmed token (for examle, for the > word "tokens" in English, it will return "tokens" [original] and > "token" [stem]). In that case, the filter has to save the word > "tokens" so that it returns "tokens" first (or the stem, the order > does not matter) and next time its next(Token) is called, it should > return the stem (or original), before comsuming the next token from > the TokenStream. > > Anyway, I hope it's clear enough, but if not I can elaborate. > If you think a copyInto() is worth the effort, I can quickly create > a patch for it). > > Shai > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > --Apple-Mail-6320--824012916 Content-Type: text/html; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Don't want to discourage you = from contributing it, just suggesting you may want to make sure the = people working on that patch are aware of the issue such that maybe it = can be addressed.


On Nov 13, 2008, at = 10:14 AM, Shai Erera wrote:

Thanks. I am aware of this thread. Indeed it will change the = way TokenStreams are handled, and so copying a Token may not be = necessary. However, I can't tell now whether this won't be necessary - I = guess I'll just have to wait until it's out and I start using it :-)
=
Anyway, I've implemented it for myself, and thought this might be a = nice contribution. I can live without it in Lucene = :-)

Thanks
Shai

On Wed, Nov = 12, 2008 at 10:02 PM, Grant Ingersoll <gsingers@apache.org> = wrote:
Are you aware of LUCENE-1422?  There is likely going to be a = new way of dealing w/ TokenStreams all together, so you might want to = have a look there before continuing.


On Nov 12, 2008, at 1:51 PM, Shai Erera = wrote:

Hi,

I was thinking about adding a copyInto method to = Token. The only way to clone a token is by using its clone() or = clone(char[], int, int, int, int) methods. Both do the job, but allocate = a Token instance. While in 2.4 a Token constructor may actually get a = char[] as input (thus saving a char[] allocation), but it still = allocates an instance.

Even though the instance allocation is = not that expensive, it does allocate additional things, like String for = the type, Payload and String (for the text, even though that will be = removed in 3.0).
If an application wishes to keep one instance of = Token around, and copy into it other Tokens, it can call various methods = to achieve that, like setTermBuffer, setOffset etc. A copyInto is just a = convenient method for doing that.

If you wonder about the use = case, then here it is: I know that it's advised to reuse the same Token = instance in the TokenStream API (basically make sure to call = next(Token). But there might be TokenFilters which will need to save a = certain occurrance of a token, do some processing and return it later. A = good example is StemmingFilter. One can think of such a filter to return = the original token in addition to the stemmed token (for examle, for the = word "tokens" in English, it will return "tokens" [original] and "token" = [stem]). In that case, the filter has to save the word "tokens" so that = it returns "tokens" first (or the stem, the order does not matter) and = next time its next(Token) is called, it should return the stem (or = original), before comsuming the next token from the TokenStream.
=
Anyway, I hope it's clear enough, but if not I can elaborate.
= If you think a copyInto() is worth the effort, I can quickly create a = patch for it).

Shai


=
= ---------------------------------------------------------------------
= To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For = additional commands, e-mail: java-dev-help@lucene.apache.org

=


= --Apple-Mail-6320--824012916--