lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anshum <ansh...@gmail.com>
Subject Re: Creating an index with multiple values for a single field
Date Sat, 08 Jan 2011 03:38:05 GMT
Hi Ryan,
You should try the synonym filter. That should help you with this kinda
problem.
You could also look at turning off norms for the name field, or turning off
tf or idf.

--
Anshum Gupta
http://ai-cafe.blogspot.com


On Sat, Jan 8, 2011 at 6:03 AM, Ryan Aylward <ryan@glassdoor.com> wrote:

> Our business has a need to allow for multiple values for a single field.
> For example, we have an index of employers where an employer often has
> multiple ways people refer to it. For example, the company "Wal-mart" is
> referred to as:
>
> 1)      Wal-mart
>
> 2)      Wal-mart Stores
>
> 3)      Walmart
> I would like a search for any of these 3 terms to match the Wal-mart
> employer.
>
> I've tried two different approaches for this.
>
> Approach 1: Create multiple values for the same field. So the document has
> these three fields:
>
> 1)      name=Wal-mart
>
> 2)      name=Wal-mart Stores
>
> 3)      name=Walmart
> The problem with this is Lucene seems to treat the 3 different fields as
> one long field of "Wal-mart Wal-mart Stores Walmart". This is problematic
> b/c term frequencies is 2 when a user searches for "Wal-mart".
>
> Approach 2: Create different named fields for each value so the document
> has these 3 fields:
>
> 1)      name1=Wal-mart
>
> 2)      name2=Wal-mart Stores
>
> 3)      name3=Walmart
> This fixes the issue above but introduces a different problem. The idf
> calculation is incorrect b/c idf is calculated per field. Most employers
> only have one name or maybe 2 names. So the name3 fields idf ends up being
> much higher b/c there are fewer docs with a given term in the name3 field.
>
> For now, I'm going with approach 2 but overriding the IndexReader.
> IndexReader.docFreq(Term t) method always returns the doc frequency from the
> name1 field even if the Term t is actually for name2 or name3, etc. But this
> doesn't feel like a clean solution.
>
> Any suggestions on how to deal with this? Any ideas would be appreciated.
> Ryan Aylward
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message