lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AHMET ARSLAN <iori...@yahoo.com>
Subject Re: Filter before tokenize ?
Date Sat, 12 Sep 2009 19:50:09 GMT
--- On Sat, 9/12/09, Paul Taylor <paul_t100@fastmail.fm> wrote:

> From: Paul Taylor <paul_t100@fastmail.fm>
> Subject: Filter before tokenize ?
> To: java-user@lucene.apache.org
> Date: Saturday, September 12, 2009, 9:39 PM
> Is it possible to filter before
> tokenize, or is that not a good idea.
> I want to convert '&' to 'and' , so they are dealt with
> the same way, but the StandardTokenizer I am using removes
> the &, I could change the tokenizer but  because
> I'm not too clear on jflex syntax it would seem easier to
> just apply a CharFilter before tokenizing, but is that
> possible

May be you can use WhitespaceTokenizer that won't remove &?
Why and's (&) are import for you? Do you need to search them?
Replacing &'s before indexing (by preprocessing) can be a option?


Filter before tokenizer can be simulated by using:

1-)KeywordTokenizer 
2-)Your CharFilter
3-)A token filter that tokenizes input token's text using StandardTokenizer

But i think this is not a good idea.

Hope this helps.


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message