lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: How do you see if a tokenstream has tokens without consuming the tokens ?
Date Wed, 19 Oct 2011 10:49:48 GMT
On 18/10/2011 05:19, Steven A Rowe wrote:
> Hi Paul,
>
> You could add a rule to the StandardTokenizer JFlex grammar to handle this case, bypassing
its other rules.
THis seemed to be working, just to test it out I changed the EMAIL one 
to this

EMAIL     =  ("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")+

And changed the order the tokens were checked

%%

{ALPHANUM}                                                     { return 
ALPHANUM; }
{APOSTROPHE}                                                   { return 
APOSTROPHE; }
{ACRONYM}                                                      { return 
ACRONYM; }
{COMPANY}                                                      { return 
COMPANY; }
{HOST}                                                         { return 
HOST; }
{NUM}                                                          { return 
NUM; }
{CJ}                                                           { return 
CJ; }
{ACRONYM_DEP}                                                  { return 
ACRONYM_DEP; }
{EMAIL}                                                        { return 
EMAIL; }

/** Ignore the rest */
. | {WHITESPACE}                                               { /* 
ignore */ }


So then if I passed "!!!' to the tokenizer, it kept it which was exactly 
what I wanted

However if I passed it 'fred!!!' it  split it into two tokens

'fred' and '!!!'

which is not what I wanted, I just wanted to get back

fred


I tried chnaging EMAIL to

EMAIL     =  ^("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")+

but use of ^ and $ seem to be disallowed, so I cant see if there is 
anyway to do what I want in the jflex, if thats the case can I drop the 
2nd filter somehow in a subsequent filter ?


Paul






>
> Another option is to create a char filter that substitutes PUNCT-EXCLAMATION for exclamation
points, PUNCT-PERIOD for periods, etc., but only when the entire input consists exclusively
of whitespace and punctuation.  These symbols would then be left intact by StandardTokenizer.
>
> Steve
>
>> -----Original Message-----
>> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
>> Sent: Monday, October 17, 2011 8:13 AM
>> To: 'java-user@lucene.apache.org'
>> Subject: How do you see if a tokenstream has tokens without consuming the
>> tokens ?
>>
>>
>> We have a modified version of a Lucene StandardAnalyzer , we use it for
>> tokenizing music metadata such as as artist names&  song titles, so
>> typically only a few words. On tokenizing it usually it strips out
>> punctuations which is correct, however if the input text consists of
>> only punctuation characters then we end up with nothing, for these
>> particular RARE cases I want to use a mapping filter.
>>
>> So what I try to do is have my analyzer tokenize as normal, then if the
>> results is no tokens retokenize with the mapping filter , I check it has
>> no token using incrementToken() but then cant see how I
>> decrementToken(). How can I do this, or is there a more efficient way of
>> doing this. Note of maybe 10,000,000 records only a few 100 records will
>> have this problem so I need a solution which doesn't impact performance
>> unreasonably.
>>
>>       NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
>>       specialcharConvertMap.add("!", "Exclamation");
>>       specialcharConvertMap.add("?","QuestionMark");
>>       ...............
>>
>>       public  TokenStream tokenStream(String fieldName, Reader reader) {
>>           CharFilter specialCharFilter = new
>> MappingCharFilter(specialcharConvertMap,reader);
>>
>>           StandardTokenizer tokenStream = new
>> StandardTokenizer(LuceneVersion.LUCENE_VERSION);
>>           try
>>           {
>>               if(tokenStream.incrementToken()==false)
>>               {
>>                   tokenStream = new
>> StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
>>               }
>>               else
>>               {
>>                   //TODO **************** set tokenstream back as it was
>> before increment token
>>               }
>>           }
>>           catch(IOException ioe)
>>           {
>>
>>           }
>>           TokenStream result = new LowercaseFilter(result);
>>           return result;
>>       }
>>
>> thanks for any help
>>
>>
>> Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message