lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Specialized Analyzer for names
Date Fri, 23 Nov 2012 20:23:01 GMT
I'd use StandardAnalyzer, or ClassicAnalyzer.  Also depends on how you
want to search.  You probably want a query for "John Smith" to match

"John Smith" and "Smith, John" but maybe not "John Brown and Sam
Smith".  The latter is a problem.  You can partially work round it by
using a BooleanQuery made up of a phrase query, and/or SpanNearQuery
with small slop and InOrder true and a general catch all clause, with
boosts on the first two.

If this is real world data there will always be exceptions and problems.


--
Ian.


On Fri, Nov 23, 2012 at 2:36 PM, Carsten Schnober
<schnober@ids-mannheim.de> wrote:
> Hi,
> I'm indexing names in a dedicated Lucene field and I wonder which
> analyzer to use for that purpose. Typically, the names are in the format
> "John Smith", so the WhitespaceAnalyzer is likely the best in most
> cases. The field type to choose seems to be the TextField.
> Or, would you rather recommend using the KeywordAnalyzer? I'm a bit
> cautious about that because I'm afraid of wildcard or regex queries such
> as "*Smith" or ".*Smith" respectively.
>
> However, there might also be special cases and spelling exceptions of
> all kinds, e.g. "Smith, John", "John 'Hammmer' Smith", "Abd al-Aziz",
> "Stan van Hoop" and what else one could imagine. Is there a special
> Analyzer that is optimized on dealing with such cases or do I have to do
> normalization beforehand?
> I see that such special characters and spellings can easily be covered
> by the right queries, but that requires the user to know the exact
> spelling, which is what I'm trying to spare her.
>
> Best regards,
> Carsten
>
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message