lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Aylward <r...@glassdoor.com>
Subject Creating an index with multiple values for a single field
Date Sat, 08 Jan 2011 00:33:22 GMT
Our business has a need to allow for multiple values for a single field. For example, we have
an index of employers where an employer often has multiple ways people refer to it. For example,
the company "Wal-mart" is referred to as:

1)      Wal-mart

2)      Wal-mart Stores

3)      Walmart
I would like a search for any of these 3 terms to match the Wal-mart employer.

I've tried two different approaches for this.

Approach 1: Create multiple values for the same field. So the document has these three fields:

1)      name=Wal-mart

2)      name=Wal-mart Stores

3)      name=Walmart
The problem with this is Lucene seems to treat the 3 different fields as one long field of
"Wal-mart Wal-mart Stores Walmart". This is problematic b/c term frequencies is 2 when a user
searches for "Wal-mart".

Approach 2: Create different named fields for each value so the document has these 3 fields:

1)      name1=Wal-mart

2)      name2=Wal-mart Stores

3)      name3=Walmart
This fixes the issue above but introduces a different problem. The idf calculation is incorrect
b/c idf is calculated per field. Most employers only have one name or maybe 2 names. So the
name3 fields idf ends up being much higher b/c there are fewer docs with a given term in the
name3 field.

For now, I'm going with approach 2 but overriding the IndexReader. IndexReader.docFreq(Term
t) method always returns the doc frequency from the name1 field even if the Term t is actually
for name2 or name3, etc. But this doesn't feel like a clean solution.

Any suggestions on how to deal with this? Any ideas would be appreciated.
Ryan Aylward

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message