lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: How do you see if a tokenstream has tokens without consuming the tokens ?
Date Tue, 18 Oct 2011 04:19:53 GMT
Hi Paul,

You could add a rule to the StandardTokenizer JFlex grammar to handle this case, bypassing
its other rules.

Another option is to create a char filter that substitutes PUNCT-EXCLAMATION for exclamation
points, PUNCT-PERIOD for periods, etc., but only when the entire input consists exclusively
of whitespace and punctuation.  These symbols would then be left intact by StandardTokenizer.

Steve

> -----Original Message-----
> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
> Sent: Monday, October 17, 2011 8:13 AM
> To: 'java-user@lucene.apache.org'
> Subject: How do you see if a tokenstream has tokens without consuming the
> tokens ?
> 
> 
> We have a modified version of a Lucene StandardAnalyzer , we use it for
> tokenizing music metadata such as as artist names & song titles, so
> typically only a few words. On tokenizing it usually it strips out
> punctuations which is correct, however if the input text consists of
> only punctuation characters then we end up with nothing, for these
> particular RARE cases I want to use a mapping filter.
> 
> So what I try to do is have my analyzer tokenize as normal, then if the
> results is no tokens retokenize with the mapping filter , I check it has
> no token using incrementToken() but then cant see how I
> decrementToken(). How can I do this, or is there a more efficient way of
> doing this. Note of maybe 10,000,000 records only a few 100 records will
> have this problem so I need a solution which doesn't impact performance
> unreasonably.
> 
>      NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
>      specialcharConvertMap.add("!", "Exclamation");
>      specialcharConvertMap.add("?","QuestionMark");
>      ...............
> 
>      public  TokenStream tokenStream(String fieldName, Reader reader) {
>          CharFilter specialCharFilter = new
> MappingCharFilter(specialcharConvertMap,reader);
> 
>          StandardTokenizer tokenStream = new
> StandardTokenizer(LuceneVersion.LUCENE_VERSION);
>          try
>          {
>              if(tokenStream.incrementToken()==false)
>              {
>                  tokenStream = new
> StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
>              }
>              else
>              {
>                  //TODO **************** set tokenstream back as it was
> before increment token
>              }
>          }
>          catch(IOException ioe)
>          {
> 
>          }
>          TokenStream result = new LowercaseFilter(result);
>          return result;
>      }
> 
> thanks for any help
> 
> 
> Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message