Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 73773 invoked from network); 10 Jan 2011 22:16:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Jan 2011 22:16:53 -0000 Received: (qmail 96530 invoked by uid 500); 10 Jan 2011 22:16:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 96481 invoked by uid 500); 10 Jan 2011 22:16:51 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 96473 invoked by uid 99); 10 Jan 2011 22:16:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jan 2011 22:16:51 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cdoronc@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jan 2011 22:16:44 +0000 Received: by qwh6 with SMTP id 6so21081872qwh.35 for ; Mon, 10 Jan 2011 14:16:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=Qh8wRebOpMG2nB2dJaWKkStTwevZi0DMeH5cx1gCvBI=; b=Cl40Zs5YvPB/tDe0+f9OUyyzAPsoyfx+zgX8zrnBudVniYbXoYA61/CRC9w2uZgL99 F45N51Pa3M+FQrtt778lpe4UoiEDkYVD0wg7ADcviFtGHXiSHl4mR+RbAkEMraCUNgAK pW6IDKsGqs0qvzWVWgvFR6xl+SW4b2etAfBrk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=avAamPh+uDdopqOraDJEvLIYxXCb8KmeLiUw6YyhaxKrT4ExX+JuuFR9O2GmOPYExe Gdv7z471HRSuSd7iMdhRwCIPL9CU/5IE+Fc71DnDF+yvp7QqOYOR1ETG2FAmLQMQfRFH OzUSdPlUHx9h6UkyJQ+Am/8zu2BxhZM2pU7K4= MIME-Version: 1.0 Received: by 10.229.235.4 with SMTP id ke4mr26020307qcb.63.1294697783098; Mon, 10 Jan 2011 14:16:23 -0800 (PST) Received: by 10.229.12.19 with HTTP; Mon, 10 Jan 2011 14:16:22 -0800 (PST) In-Reply-To: References: Date: Tue, 11 Jan 2011 00:16:22 +0200 Message-ID: Subject: Re: Creating an index with multiple values for a single field From: Doron Cohen To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e646514873fe400499855172 X-Virus-Checked: Checked by ClamAV on apache.org --0016e646514873fe400499855172 Content-Type: text/plain; charset=ISO-8859-1 On Mon, Jan 10, 2011 at 7:44 PM, Ryan Aylward wrote: > We do leverage synonyms but they are not appropriate for this case. We use > synonyms for words that are truly synonymous for the entire index such as > "inc" and "incorporated". Those words are always interchangeable. However, > many of the employer alternate names are only valid for a single employer > not for the entire index. > We do disable the lengthNorm but we benefit from tf and idf so disabling > those would cause more harm than good. > Any other suggestions would be appreciated. > How about indexing this specific field without analysis - except perhaps for lower casing - i.e. in the above example the field would have exactly 3 tokens: [wal-mart], [wal-mart stores], [walmart]. At search time this field would be treated the same way, that is, no analysis except for lower casing. Since norms are already omitted for this field its lengths differences between docs would not affect scores. HTH, Doron > Thanks. > > -----Original Message----- > From: Anshum [mailto:anshumg@gmail.com] > Sent: Friday, January 07, 2011 7:38 PM > To: java-user@lucene.apache.org > Subject: Re: Creating an index with multiple values for a single field > > Hi Ryan, > You should try the synonym filter. That should help you with this kinda > problem. > You could also look at turning off norms for the name field, or turning off > tf or idf. > > -- > Anshum Gupta > http://ai-cafe.blogspot.com > > > On Sat, Jan 8, 2011 at 6:03 AM, Ryan Aylward wrote: > > > Our business has a need to allow for multiple values for a single field. > > For example, we have an index of employers where an employer often has > > multiple ways people refer to it. For example, the company "Wal-mart" is > > referred to as: > > > > 1) Wal-mart > > > > 2) Wal-mart Stores > > > > 3) Walmart > > I would like a search for any of these 3 terms to match the Wal-mart > > employer. > > > > I've tried two different approaches for this. > > > > Approach 1: Create multiple values for the same field. So the document > has > > these three fields: > > > > 1) name=Wal-mart > > > > 2) name=Wal-mart Stores > > > > 3) name=Walmart > > The problem with this is Lucene seems to treat the 3 different fields as > > one long field of "Wal-mart Wal-mart Stores Walmart". This is problematic > > b/c term frequencies is 2 when a user searches for "Wal-mart". > > > > Approach 2: Create different named fields for each value so the document > > has these 3 fields: > > > > 1) name1=Wal-mart > > > > 2) name2=Wal-mart Stores > > > > 3) name3=Walmart > > This fixes the issue above but introduces a different problem. The idf > > calculation is incorrect b/c idf is calculated per field. Most employers > > only have one name or maybe 2 names. So the name3 fields idf ends up > being > > much higher b/c there are fewer docs with a given term in the name3 > field. > > > > For now, I'm going with approach 2 but overriding the IndexReader. > > IndexReader.docFreq(Term t) method always returns the doc frequency from > the > > name1 field even if the Term t is actually for name2 or name3, etc. But > this > > doesn't feel like a clean solution. > > > > Any suggestions on how to deal with this? Any ideas would be appreciated. > > Ryan Aylward > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --0016e646514873fe400499855172--