lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <>
Subject Re: How do you see if a tokenstream has tokens without consuming the tokens ?
Date Wed, 19 Oct 2011 09:26:28 GMT
On 18/10/2011 15:25, Steven A Rowe wrote:
> Hi Paul,
> On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
>> On 18/10/2011 06:19, Steven A Rowe wrote:
>>> Another option is to create a char filter that substitutes
>>> PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
>>> etc.,
>> Yes that is how I first did it
> No, I don't think you did.  When I say "char filter" I'm referring to CharFilter<>
 - this is a different kind of thing from the token filter approach you described taking previously.
If you look at the code you can see I do use a CharFilter:

NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
     specialcharConvertMap.add("!", "Exclamation");

     public  TokenStream tokenStream(String fieldName, Reader reader) {
         CharFilter specialCharFilter = new 

         StandardTokenizer tokenStream = new 
                 tokenStream = new 
StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
                 //TODO **************** set tokenstream back as it was 
before increment token
         catch(IOException ioe)

         TokenStream result = new LowercaseFilter(result);
         return result;

> If you go with a CharFilter, you can give it access to the entire input at once, and
use a regular expression (or something like it) to assess the input and then behave accordingly.
> Steve
Well this is the problem, you cant use a regular expression or even if 
you did would that really slow things down wouldn't it, seeing as 99% 
dont need the transformation.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message