lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: How do you see if a tokenstream has tokens without consuming the tokens ?
Date Wed, 19 Oct 2011 14:17:35 GMT
Hi Paul,

What version of Lucene are you using?  The JFlex spec you quote below looks pre-v3.1?

Steve

> -----Original Message-----
> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
> Sent: Wednesday, October 19, 2011 6:50 AM
> To: Steven A Rowe; java-user@lucene.apache.org >> "'java-
> user@lucene.apache.org'"
> Subject: Re: How do you see if a tokenstream has tokens without consuming
> the tokens ?
> 
> On 18/10/2011 05:19, Steven A Rowe wrote:
> > Hi Paul,
> >
> > You could add a rule to the StandardTokenizer JFlex grammar to handle
> this case, bypassing its other rules.
> THis seemed to be working, just to test it out I changed the EMAIL one
> to this
> 
> EMAIL     =  ("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")+
> 
> And changed the order the tokens were checked
> 
> %%
> 
> {ALPHANUM}                                                     { return
> ALPHANUM; }
> {APOSTROPHE}                                                   { return
> APOSTROPHE; }
> {ACRONYM}                                                      { return
> ACRONYM; }
> {COMPANY}                                                      { return
> COMPANY; }
> {HOST}                                                         { return
> HOST; }
> {NUM}                                                          { return
> NUM; }
> {CJ}                                                           { return
> CJ; }
> {ACRONYM_DEP}                                                  { return
> ACRONYM_DEP; }
> {EMAIL}                                                        { return
> EMAIL; }
> 
> /** Ignore the rest */
> . | {WHITESPACE}                                               { /*
> ignore */ }
> 
> 
> So then if I passed "!!!' to the tokenizer, it kept it which was exactly
> what I wanted
> 
> However if I passed it 'fred!!!' it  split it into two tokens
> 
> 'fred' and '!!!'
> 
> which is not what I wanted, I just wanted to get back
> 
> fred
> 
> 
> I tried chnaging EMAIL to
> 
> EMAIL     =  ^("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")+
> 
> but use of ^ and $ seem to be disallowed, so I cant see if there is
> anyway to do what I want in the jflex, if thats the case can I drop the
> 2nd filter somehow in a subsequent filter ?
> 
> 
> Paul
> 
> 
> 
> 
> 
> 
> >
> > Another option is to create a char filter that substitutes PUNCT-
> EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., but
> only when the entire input consists exclusively of whitespace and
> punctuation.  These symbols would then be left intact by
> StandardTokenizer.
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
> >> Sent: Monday, October 17, 2011 8:13 AM
> >> To: 'java-user@lucene.apache.org'
> >> Subject: How do you see if a tokenstream has tokens without consuming
> the
> >> tokens ?
> >>
> >>
> >> We have a modified version of a Lucene StandardAnalyzer , we use it
> for
> >> tokenizing music metadata such as as artist names&  song titles, so
> >> typically only a few words. On tokenizing it usually it strips out
> >> punctuations which is correct, however if the input text consists of
> >> only punctuation characters then we end up with nothing, for these
> >> particular RARE cases I want to use a mapping filter.
> >>
> >> So what I try to do is have my analyzer tokenize as normal, then if
> the
> >> results is no tokens retokenize with the mapping filter , I check it
> has
> >> no token using incrementToken() but then cant see how I
> >> decrementToken(). How can I do this, or is there a more efficient way
> of
> >> doing this. Note of maybe 10,000,000 records only a few 100 records
> will
> >> have this problem so I need a solution which doesn't impact
> performance
> >> unreasonably.
> >>
> >>       NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
> >>       specialcharConvertMap.add("!", "Exclamation");
> >>       specialcharConvertMap.add("?","QuestionMark");
> >>       ...............
> >>
> >>       public  TokenStream tokenStream(String fieldName, Reader reader)
> {
> >>           CharFilter specialCharFilter = new
> >> MappingCharFilter(specialcharConvertMap,reader);
> >>
> >>           StandardTokenizer tokenStream = new
> >> StandardTokenizer(LuceneVersion.LUCENE_VERSION);
> >>           try
> >>           {
> >>               if(tokenStream.incrementToken()==false)
> >>               {
> >>                   tokenStream = new
> >> StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
> >>               }
> >>               else
> >>               {
> >>                   //TODO **************** set tokenstream back as it
> was
> >> before increment token
> >>               }
> >>           }
> >>           catch(IOException ioe)
> >>           {
> >>
> >>           }
> >>           TokenStream result = new LowercaseFilter(result);
> >>           return result;
> >>       }
> >>
> >> thanks for any help
> >>
> >>
> >> Paul
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message