lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: replace values in index
Date Thu, 12 Jul 2007 14:07:17 GMT
While it is possible to alter the StandardAnalyzer, depending on more 
details of your source text, it may be better to use a different 
analyzer or make your own. The StandardAnalyzer is quite slow if you do 
not need all of its features, and modifying it will make it harder to 
keep up with bug fixes or improvements.

That said, StandardAnalyzer does split on commas, so you might want to 
check into whats really going on.

I suspect that 'word1,word2,word3,word4,word5' is being recognized as a 
NUM by StandardAnalzyer. A NUM match will keep a comma deliminated list 
intact as long as every other word contains a digit.

You might alter the <#P regular expression in StandardAnalyzer.jj by 
taking out the ','. This will take out certain matches (like the match 
your getting <g>), but will stop screwing up your matches.
- Mark

Jeff wrote:
> I have documents with lots of text. Part of the text is in the following
> format:
> word1,word2,word3,word4,word5
> I am currently using the StandardAnalyzer and everything is working great
> with the other data, except I can't query for 'word3' as a ',' isn't a 
> token
> seperator. Is there an easy way to add ',' as a token seperator?
> Thanks,
> -Jeff

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message