lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <>
Subject Re: Specialized Analyzer for names
Date Fri, 23 Nov 2012 20:23:01 GMT
I'd use StandardAnalyzer, or ClassicAnalyzer.  Also depends on how you
want to search.  You probably want a query for "John Smith" to match

"John Smith" and "Smith, John" but maybe not "John Brown and Sam
Smith".  The latter is a problem.  You can partially work round it by
using a BooleanQuery made up of a phrase query, and/or SpanNearQuery
with small slop and InOrder true and a general catch all clause, with
boosts on the first two.

If this is real world data there will always be exceptions and problems.


On Fri, Nov 23, 2012 at 2:36 PM, Carsten Schnober
<> wrote:
> Hi,
> I'm indexing names in a dedicated Lucene field and I wonder which
> analyzer to use for that purpose. Typically, the names are in the format
> "John Smith", so the WhitespaceAnalyzer is likely the best in most
> cases. The field type to choose seems to be the TextField.
> Or, would you rather recommend using the KeywordAnalyzer? I'm a bit
> cautious about that because I'm afraid of wildcard or regex queries such
> as "*Smith" or ".*Smith" respectively.
> However, there might also be special cases and spelling exceptions of
> all kinds, e.g. "Smith, John", "John 'Hammmer' Smith", "Abd al-Aziz",
> "Stan van Hoop" and what else one could imagine. Is there a special
> Analyzer that is optimized on dealing with such cases or do I have to do
> normalization beforehand?
> I see that such special characters and spellings can easily be covered
> by the right queries, but that requires the user to know the exact
> spelling, which is what I'm trying to spare her.
> Best regards,
> Carsten
> --
> Institut für Deutsche Sprache |
> Projekt KorAP                 |
> Tel. +49-(0)621-43740789      |
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message