lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Sekiguchi <>
Subject Re: Proposal for introducing CharFilter
Date Wed, 19 Nov 2008 02:16:38 GMT
Chris Hostetter wrote:
 > : > If a given Tokenizer does not need to do any character 
normalization (I
 > : would think most wouldn't) is there any added cost during 
tokenization with
 > : this change?
 > :
 > : Thank you for your reply, Mike!
 > : There is no added cost if Tokenizer doesn't need to call 
 > But every tokenizer *should* call correctOffset on the start/end 
offset of
 > every token it produces correct?


 > My understanding is that we would imake a change like this is...
 > 1) change the Tokenizer class to look something like this...


 > 2) change all of the Tokenizers shipped with Lucene to use correctOffset
 > when setting all start/end offsets on any Tokens.
 > ...once those two things are done, anyone using out-of-the-box 
 > can use a CharStream and get correct offsets -- anyone with an existing
 > custom Tokenizer should continue to get the same behavior as before, but
 > if they wnat to start using a CharStream they need to tweak there code.

Looks great!

 > The only potential downside i can think of is the performance cost of 
 > added method calls -- but if we make NoOpCharStream.correctOffset final
 > the JVM should be able to able to optimize away the "identity" function
 > correct?

I didn't take care of JVM optimization, however, we have already have
the final class "CharReader" in Solr 1.4:

public final class CharReader extends CharStream {
  protected Reader input;
  public CharReader( Reader in ){
    input = in;
  public int correctOffset(int currentOff) {
    return currentOff;

and CharReader is instantiated in TokenizerChain.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message