lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hiroaki Kawai <ka...@iij.ad.jp>
Subject Re: Token implementation
Date Tue, 20 May 2008 04:49:25 GMT


"Michael McCandless" <lucene@mikemccandless.com> wrote:
> I agree the situation is not ideal, and it's confusing.
> 
> This comes back to LUCENE-969.
> 
> At the time, we decided to keep both String & char[] only to avoid
> performance cost for those analyzer chains that use String tokens
> exclusively.
> 
> The idea was to allow Token to keep both text or char[] and sometimes
> both (if they are storing the same characters, as happens if
> termBuffer() is called when it's a String being stored)
> 
> Then, in 3.0, we would make the change you are proposing (to only
> store char[] internally).  That was the plan, anyway.  Accelerating
> this plan (to store only char[] today) is compelling, but I worry
> about the performance hit to legacy analyzer chains...

I'd like to suggest another implementation which use 
StringBuilder or CharBuffer instead of char[].
Because we don't need to maintain the length separatly from the 
characater sequence itself. 
If we use char[], then we have to handle char[] and the offset and the
sequence length, the method we implement will be so complex.
I think those should be packed into one object.

I did not test that using StringBuilder or CharBuffer hit the 
performance or not. But I think it might not result in so bad performace.


> More responses below:
> DM Smith <dmsmith555@gmail.com> wrote:
-snip-
> > I was looking at this in light of TokenFilter's next(Token) method and how
> > it was being used. In looking at the contrib filters, they have not been
> > modified. Further, most of them, if they work with the content analysis and
> > generation, do their work in strings. Some of these appear to be good
> > candidates for using char[] rather than strings, such as the NGram filter.
> > But others look like they'd just as well remain with String manipulation.
> 
> It would be great to upgrade all contrib filters to use the re-use APIs.

I'll contribute, too. :-)


> > I'd like to suggest that internally, that Token be changed to only use
> > char[] termBuffer and eliminate termText.
> 
> The question is what performance cost we are incurring eg on the
> contrib (& other) sources/filters?  Every time setTermText is called,
> we copy out the chars (instead of holding a reference to the String).
> Every time getText() is called we create a new String(...) from the
> char[].  I think it's potentially a high cost, and so maybe we should
> wait until 3.0 when we drop the deprecated APIs?
> 
> > And also, that termText be restored as not deprecated.
> 
> It made me nervous keeping this method because it looks like it should
> be cheap to call, and in the past it was very cheap to call.  But,
> maybe we could keep it, if we mark very very clearly in the javadocs
> the performance cost you incur by using this method (it makes a new
> String() every time)?

I'd like to suggest changing the method definition to:
 public void setTermText(CharSequence text)


> > But, in TokenFilter, next() should be deprecated, IMHO.
> 
> I think this is a good idea.  After all if people don't want to bother
> using the passed in Token, they are still allowed to return a new
> one.

I could not see what you meant. Can I ask you to let me know the reason
why it should be deprecated?




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message