lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Proposal for introducing CharFilter
Date Tue, 11 Nov 2008 22:46:35 GMT

This looks like a good idea, thanks!

If a given Tokenizer does not need to do any character normalization  
(I would think most wouldn't) is there any added cost during  
tokenization with this change?

Mike

Koji Sekiguchi wrote:

> I'm working on SOLR-822 and trying to introduce new classes  
> CharStream,
> CharReader and CharFilter into Solr:
>
> CharFilter - normalize characters before tokenizer
> https://issues.apache.org/jira/browse/SOLR-822
>
> CharFilter(s) will be placed between Reader and Tokenizer:
>
> // CharReader is needed to convert Reader to CharStream
> TokenStream stream = new MyTokenFilter( new MyTokenizer(
> new MyCharFilter( new CharReader( reader ) ) ) );
>
> and it does character-level filtering like as TokenFilter does
> Token-level filtering.
>
> I attached a nice JPEG sample for "character normalization" in  
> SOLR-822.
> Please see:
>
> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
>
> As you can see, if you use CharFilter, Token offsets could be  
> incorrect
> because CharFilters may convert 1 char to 2 chars or the other way
> around. So, CharFilter has a method "correctOffset()" (CharStream
> defines the method as abstract and CharFilter extends CharStream.
> See SOLR-822 for the detail) so that Tokenizer can correct token
> offsets. But Tokenizer should be "CharStream aware" to call the
> method. What do folks feel about introducing CharFilter into Lucene
> and changing *all* Tokenizers to "CharStream aware" Tokenizers in
> Lucene 2.9/3.0?
>
> Thank you,
>
> Koji
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message