lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: Whitespace/Standard Analyzer and punctuation
Date Wed, 30 Sep 2009 09:27:30 GMT
You could look in to modifying the standard tokenizer lexer code to  
handle punctuation (there is a patch in the isssue tracker for the old  
javacc grammer to handle punctuation) and there is also the Gate NLP  
project which has a fairly nice sentence splitter you might find  
useful. Add a whole bunch of position increment between your sentences  
and limit your searches to how much distance you allow for a hit.

I hope this helps.


30 sep 2009 kl. 05.54 skrev Max Lynch:

> I would like my searches to match "John Smith" when John Smith is in a
> document, but not separated with punctuation.  For example, when I  
> was using
> StandardAnalyzer, "John. Smith" was matching, which is wrong for  
> me.  Right
> now I am using WhitespaceAnalyzer but instead searching for "John  
> Smith"
> "John Smith." "John Smith," etc., which seems like a dumb thing to be
> doing.  Can I separate the punctuation but keep the analyzer aware  
> of where
> the punctuation occurred in my matching term?
> Thanks.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message