lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Böckling <Michael.Boeckl...@dmc.de>
Subject AW: Modifying StandardAnalyzer so that it also splits words after pun ctuation characters that are not followed by whitespace
Date Wed, 30 May 2007 09:57:49 GMT
Ok, I've followed your advice and commented out some Lines in the NUM
section. It now works as espected, thanks a lot, I just tried and it does
what I wanted it to do now. It looks scary, but isn't that bad. 

Thanks!

Regards,
Michael



> -----Ursprüngliche Nachricht-----
> Von: Steven Rowe [mailto:sarowe@syr.edu]
> Gesendet: Dienstag, 29. Mai 2007 19:54
> An: java-user@lucene.apache.org
> Betreff: Re: Modifying StandardAnalyzer so that it also splits words
> after pun ctuation characters that are not followed by whitespace
> 
> 
> Hi Michael,
> 
> Michael Böckling wrote:
> > Hi folks!
> > 
> > The topic says it all: I want to modify the 
> StandardAnalyzer so that it also
> > splits words after punctuation characters (.,: etc.) that 
> are NOT followed
> > by a whitespace character, in addition to punctuation 
> characters that ARE
> > followed by whitespace.
> > 
> > Of course i've looked at StandardTokenizer.jj, but I don't 
> quite get it. The
> > recursive nature of the grammar bends my mind.
> > 
> > Can someone smarter than me help here?
> 
> Um, that probably disqualifies me, but anyway...
> 
> There are several regexes in StandardTokenizer.jj that generate tokens
> containing punctuation.  You should be able to selectively 
> comment them
> out to achieve what you want:
> 
> 1. Acronyms:
> 
>   | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
> 
> 2. Company names:
> 
>   | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
> 
> 3. Email addresses:
> 
>   | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>     (("."|"-") <ALPHANUM>)+ >
> 
> 4. Hostnames:
> 
>   | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
> 
> 5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.:
> 
>   | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>          | <HAS_DIGIT> <P> <ALPHANUM>
>          | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>          | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>          | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>

> <HAS_DIGIT>)+
>          | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT>
<P> 
> <ALPHANUM>)+
>           )
>     >
>   | <#P: ("_"|"-"|"/"|"."|",") >
>   | <#HAS_DIGIT:		  // at least one digit
>     (<LETTER>|<DIGIT>)*
>     <DIGIT>
>     (<LETTER>|<DIGIT>)*
>     >
> 
> 
> Steve
> 
> -- 
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message