lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Handling hyphens and other puncuation in proper nouns
Date Thu, 25 May 2006 00:18:08 GMT
There are several analyzers provided with Lucene that you could check out.
SimpleAnalyzer, WhitespaceAnalyzer and KeywordAnalyzer all come to mind.
Certainly WhitespaceAnalyzer won't break at the hyphen etc.

NOTE: be sure you pay attention to what analyzer is used if you are using
QueryParser, since the terms in the query are analyzed too.

 I've had fun with PerFieldAnalyzerWrapper to handle different analyzers for
different fields if that's something you want to do.

See Analyzer in the JavaDoc, it lists "all known subclasses" which will lead
you to the ones mentioned above plus quite a few others. PatternAnalyzer
works with regular expressions. How cool is that?

You *may* want to get into your own analyzer and/or pre-processing the
tokens before indexing and/or before querying. For instance, should O'Brian
match OBrian? (notice the apostrophe, it may not be obvious depending on
your font).


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message