lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Better analysis of hyphenated words
Date Thu, 27 Oct 2005 18:13:38 GMT

On 27 Oct 2005, at 12:13, Rob Young wrote:
> I'm using StandardAnalyzer during indexing and I have noticed that  
> it splits hyphenated words in two, ditching the hyphen. This is  
> messing up some of my search results. I would like to keep using  
> StandardAnalyzer because it's very good on the whole, however I  
> would like to add an extra term in these cases. I am fine doing  
> everything except figuring out when StandardTokenizer has split a  
> hyphenated word. All I get is the individual tokens with a type  
> ALPHANUM. Can anyone think of a way I can do this without having to  
> dive into StandardTokenizer?
>
> I have looked at the source for StandardTokenizer and I really  
> really really don't want to have to go there :/

StandardTokenizer is a JavaCC grammar - and it's actually not that  
complex, though JavaCC is a whole other technology to learn if you've  
not done it before.  Look at StandardTokenizer.jj, not .java.

You could pretty easily modify the .jj file and add the hyphen to the  
alphanumeric tokens, rebuild it using JavaCC (the Ant build file for  
Lucene can do this for you once you have JavaCC).

Using StandardTokenizer without modifying it won't be possible to  
achieve what you're after - the damage is already done on the output  
of StandardTokenizer.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message