lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Proposal for introducing CharFilter
Date Tue, 18 Nov 2008 20:37:01 GMT

: > If a given Tokenizer does not need to do any character normalization (I
: would think most wouldn't) is there any added cost during tokenization with
: this change?
: Thank you for your reply, Mike!
: There is no added cost if Tokenizer doesn't need to call correctOffset().

But every tokenizer *should* call correctOffset on the start/end offset of 
every token it produces correct?

My understanding is that we would imake a change like this is...

1) change the Tokenizer class to look something like this...

public abstract class Tokenizer extends TokenStream {
  protected CharStream input;
  protected Tokenizer() {}
  protected Tokenizer(Reader input) {
    this(new NoOpCharStream(input));
  protected Tokenizer(CharStream input) {
    this.input = input;
  public void close() throws IOException {
  public void reset(Reader input) throws IOException {
    if (input instanceof CharStream) {
       this.input = (CharStream)input;
    } else {
       this.input = new NoOpCharStream(input);

2) change all of the Tokenizers shipped with Lucene to use correctOffset 
when setting all start/end offsets on any Tokens.

...once those two things are done, anyone using out-of-the-box tokenizers 
can use a CharStream and get correct offsets -- anyone with an existing 
custom Tokenizer should continue to get the same behavior as before, but 
if they wnat to start using a CharStream they need to tweak there code.

The only potential downside i can think of is the performance cost of the 
added method calls -- but if we make NoOpCharStream.correctOffset final 
the JVM should be able to able to optimize away the "identity" function 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message