Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of yseeley@gmail.com designates
 64.233.162.233 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth;
        b=fdUjVioX6/aBRaGWkiFdH8WQBsyK792aH9OP7aUTkbc7dKNvN7NRX3ne/XCYz9Qk8LYbDnq2cSMKIO3SmMLXHweX3CVziJr4P0pgyXrSnOLmmQ/107h1Yu7qFS58mcFDwFM779GJ5Suu8Lq4IDxGVe/1yuK7d3XWiQkmQV5jX6U=
Message-ID: <c68e39170707201152n70be7c69gb84ded70509353c0@mail.gmail.com>
Date: Fri, 20 Jul 2007 14:52:42 -0400
From: "Yonik Seeley" <yonik@apache.org>
Sender: yseeley@gmail.com
To: java-dev@lucene.apache.org
Subject: Re: Token termBuffer issues
In-Reply-To: <1184891270.16597.1201090809@webmail.messagingengine.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <c68e39170707191523h54f6a10ft59780102ea4c9ba7@mail.gmail.com>
	 <1184891270.16597.1201090809@webmail.messagingengine.com>

On 7/19/07, Michael McCandless <lucene@mikemccandless.com> wrote:
> "Yonik Seeley" <yonik@apache.org> wrote:
> > I had previously missed the changes to Token that add support for
> > using an array (termBuffer):
> >
> > +  // For better indexing speed, use termBuffer (and
> > +  // termBufferOffset/termBufferLength) instead of termText
> > +  // to save new'ing a String per token
> > +  char[] termBuffer;
> > +  int termBufferOffset;
> > +  int termBufferLength;
> >
> > While I think this approach would have been best to start off with
> > rather than String,
> > I'm concerned that it will do little more than add overhead at this
> > point, resulting in slower code, not faster.
> >
> > - If any tokenizer or token filter tries setting the termBuffer, any
> > downstream components would need to check for both.  It could be made
> > backward compatible by constructing a string on demand, but that will
> > really slow things down, unless the whole chain is converted to only
> > using the char[] somehow.
>
> Good point: if your analyzer/tokenizer produces char[] tokens then
> your downstream filters would have to accept char[] tokens.
>
> I think on-demand constructing a String (and saving it as termText)
> would be an OK solution?  Why would that be slower than having to make
> a String in the first place (if we didn't have the char[] API)?  It's
> at least graceful degradation.

It's the rule rather than the exception though.  Pretty much
everything is based on String.

> > - It doesn't look like the indexing code currently pays any attention
> > to the char[], right?
>
> It does, in DocumentsWriter.addPosition().

Ah, thanks.

> > - What if both the String and char[] are set?  A filter that doesn't
> > know better sets the String... this doesn't clear the char[]
> > currently, should it?
>
> Currently the char[] wins, but good point: seems like each setter
> should null out the other one?

Certainly the String setter should null the char[] (that's the only
way to keep back compatibility), and probably vice-versa.

Note that there are many existing filters that directly access and
manipulate the package protected String termText.  These will need to
be changed.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org