lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: How do you see if a tokenstream has tokens without consuming the tokens ?
Date Thu, 20 Oct 2011 07:59:27 GMT
On 19/10/2011 15:17, Steven A Rowe wrote:
> Hi Paul,
>
> What version of Lucene are you using?  The JFlex spec you quote below looks pre-v3.1?
>
Yes, we copied a version of StandardTokenizer from 2.4 to make some 
changes, we are actually on 3.1 now but haven't spent any time looking 
at the new tokenizer flex code which appears better.

Anyway I finally have a proof of concept that I think will work this time

I realised that if someone enters 'fred!!!' I dont want to just match 
match to 'fred', because then another token will be created for '!!!' so 
Ive created separate rules for matching

fred     (ALPHANUM)
fred!!!  (EMAIL)
!!!        (COMPANY)

Modified jflex to catch

// basic word: a sequence of digits & letters (includes Thai to enable 
ThaiAnalyzer to function)
ALPHANUM   = ({LETTER}|{THAI}|[:digit:])+

// 'PUNCTUATIONCONTROL' control/punctuation chars
CONTROLANDPUNC     =  ("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")

COMPANY    =  ({CONTROLANDPUNC})+


//MUST CONTAIN Alphanumeric and Punctuation Characters
EMAIL              = 
({ALPHANUM}|{CONTROLANDPUNC})*{CONTROLANDPUNC}{ALPHANUM}({ALPHANUM}|{CONTROLANDPUNC})* 
|
                      
({ALPHANUM}|{CONTROLANDPUNC})*{ALPHANUM}{CONTROLANDPUNC}({ALPHANUM}|{CONTROLANDPUNC})*

%%

{EMAIL}                                                        { return 
EMAIL; }
{ALPHANUM}                                                     { return 
ALPHANUM; }
{COMPANY}                                                      { return 
COMPANY; }

Then I have a filter that looks for type=EMAIL and removes those 
punctuation chars

public final boolean incrementToken() throws java.io.IOException {
         if (!input.incrementToken()) {
             return false;
         }

         char[] buffer = termAtt.buffer();
         final int bufferLength = termAtt.length();
         final String type = typeAtt.type();

         if (type == EMAIL) {      // remove control chars when they 
make up only part of the token
             int upto = 0;
             for (int i = 0; i < bufferLength; i++) {
                 char c = buffer[i];
                 if (
                         (c == '!')
                    ) {
                     //Do Nothing, (drop the character)
                 }
                 else {
                     buffer[upto++] = c;
                 }
             }
             termAtt.setLength(upto);
         }
         return true;
     }

I just need to improve the code to use suitable list of control chars 
rather than hardcoding individual chars.

This solution seems the closest fit to lucene.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message