lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Looking for a way to customize how StandardAnalyzer handles punctuation
Date Thu, 11 Dec 2008 01:52:38 GMT
Let's take a quick step back and see if it helps.  Why do you feel you  
need the StandardAnalyzer to solve your problem?  What else are you  
gaining from it?  Would you be better served by a WhitespaceTokenizer?

That being said, hacking up the grammar isn't as bad as you might  
think.  There are actually two examples of the "grammar" in Lucene,  
one is the StdTokenizer and the other is the WikipediaTokenizer.  They  
are similar, but maybe by looking at two examples it might also help.

On Dec 9, 2008, at 10:14 AM, Greg Shackles wrote:

> Hey everyone,
> I'm running into a problem where some punctuation that I would  
> actually want
> to keep gets thrown out because they don't get tokenized.  By far  
> the most
> common case for this is ampersand, but it does happen with others as  
> well.
> My concern isn't even so much in that I need to be able to enforce  
> that
> punctuation in the search, but more that I need to know it was there  
> when I
> get the results.  I am attaching important word data to the payload  
> of each
> token, so if a "word" was just an ampersand, it disappears.  I took  
> a quick
> look at the StandardAnalyzer classes and it looks like it would be a  
> pain to
> try and modify that directly (I don't have much experience in
> grammar/parsers).  A couple options come to mind, but I wanted to  
> make sure
> there wasn't a better, more elegant solution before I did something  
> that
> felt a little hacky:
> 1) Add a couple fields to the payload saying whether the previous/ 
> next word
> is a single punctuation mark, and which it is.  Then the search can  
> insert
> the punctuation in the results.  The downside to this would be  
> losing the
> metadata that would have gone into the payload for that punctuation  
> mark.
> 2) Do some sort of string replacement logic during indexing and  
> searching to
> change it into something that will get made into a token, but should  
> not
> appear naturally on its own in the text.  I usually shy away from  
> solutions
> like this, but sometimes they prove useful.
> Has anyone done anything like this?  I don't want to lose most of
> StandardAnalyzer's punctuation logic, but mainly I want to tokenize
> punctuation if it appears by itself (surrounded by whitespace).   
> Thanks!
> - Greg

Grant Ingersoll

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message