lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: Token termBuffer issues
Date Tue, 24 Jul 2007 20:15:04 GMT
On 7/24/07, Michael McCandless <lucene@mikemccandless.com> wrote:
> OK, I ran some benchmarks here.
>
> The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and
> 17.2% speedup using Sun's JDK 6, on Linux.  This is indexing all
> Wikipedia content using LowerCaseTokenizer + StopFilter +
> PorterStemFilter.  I think it's worth pursuing!

Did you try it w/o token reuse (reuse tokens only when mutating, not
when creating new tokens from the tokenizer)?
It would be interesting to see what's attributable to Token reuse only
(after core filters have been optimized to use the char[] setters,
etc).

We've had issues in the past regarding errors with filters dealing
with token properties:
 1)  filters creating a new token from and old token, but forgetting
about setting positionIncrement
 2) legacy filters losing "new" information such as payloads when
creating , because they didn't exist when the filter was written.

#1 is solved by token mutation because there are setters for the value
(before, the filter author was forced to create a new token, unless
they could access the package-private String).

#2 can now be solved by clone() (another relatively new addition)

So what new problems might crop up with token reuse?
 - a filter reusing a token, but not zeroing out something new like
payloads because they didn't exist when the filter was authored (the
opposite problem from before)

Would a Token.clear() be needed for use by (primarily) tokenizers?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message