Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 31518 invoked from network); 23 Mar 2010 22:06:29 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 23 Mar 2010 22:06:29 -0000 Received: (qmail 72938 invoked by uid 500); 23 Mar 2010 22:06:27 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 72909 invoked by uid 500); 23 Mar 2010 22:06:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 72901 invoked by uid 99); 23 Mar 2010 22:06:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Mar 2010 22:06:27 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [128.230.18.92] (HELO smtp2.syr.edu) (128.230.18.92) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Mar 2010 22:06:16 +0000 Received: from suex07-hub-02.ad.syr.edu (suex07-hub-02.ad.syr.edu [128.230.108.196]) by smtp2.syr.edu (8.14.3/8.14.3) with ESMTP id o2NM5t9w003854 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL) for ; Tue, 23 Mar 2010 18:05:55 -0400 Received: from suex07-mbx-03.ad.syr.edu ([128.230.108.133]) by suex07-hub-02.ad.syr.edu ([fe80::813b:49a2:a4d5:6367%10]) with mapi; Tue, 23 Mar 2010 18:05:55 -0400 From: Steven A Rowe To: "java-user@lucene.apache.org" Date: Tue, 23 Mar 2010 18:05:53 -0400 Subject: RE: Lucene query with long strings Thread-Topic: Lucene query with long strings Thread-Index: AcrKzQoiRj1s/7P5TPW3ZsJ2Mq8oBwABvJJw Message-ID: <2D127F11DC79714E9B6A43AC9458147F36662882@suex07-mbx-03.ad.syr.edu> References: <369772.63337.qm@web50706.mail.re2.yahoo.com> In-Reply-To: <369772.63337.qm@web50706.mail.re2.yahoo.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.8161:2.4.5,1.2.40,4.0.166 definitions=2010-03-23_12:2010-02-06,2010-03-23,2010-03-23 signatures=0 X-Proofpoint-Spam-Reason: safe X-Virus-Checked: Checked by ClamAV on apache.org Hi Aaron, Your "false positives" comments point to a mismatch between what you're cur= rently asking Lucene for (any document matching any one of the terms in the= query) and what you want (only fully "correct" matches). You need to identify the terms of the query that MUST match and tell Lucene= about it ("+" syntax is understood by QueryParser to mean a required term)= . If your queries come from sources that don't reliably match the indexes val= ues, you may need to use synonyms to map between e.g. "California" and "CA"= , and then require that at least one of the synonyms matches (e.g. "+(Calif= ornia CA)"). Steve On 03/23/2010 at 5:08 PM, Aaron Schon wrote: > hi all, I have been playing with=A0Lucene for a while now, but stuck on a > perplexing issue. >=20 > I have an index, with a field "Affiliation",=A0some example values are: >=20 > - "Stanford University School of Medicine, Palo Alto, CA USA", - > "Institute of Neurobiology, School of Medicine, Stanford University, > Palo Alto, CA", - "School of Medicine, Harvard University, Boston MA", - > "Brigham & Women's, Harvard University School of Medicine, Boston, MA" - > "Harvard University, Cambridge MA" >=20 > and so on... (the bottom-line being the affiliations are written in > multiple ways with no apparent consistency) >=20 > I query the index on=A0 the affiliation field using say "School of > Medicine, Stanford University, Palo Alto, CA" (with QueryParser)=A0to > find all Stanford related documents, I get a lot of false +ves, > presumably because of the presence of School of Medicine etc. etc. > (note: I cannot use Phrase query because of variability in the way > affiliation is constructed) >=20 > I have tried the following: >=20 > 1. Use a SpanNearQuery by splitting the search phrase with a whitespace > (here I get no results!) > 2. Tried boosting (using ^) by splitting with the comma and boosting > the last parts such as "Palo Alto CA" with a much higher boost than the > initial phrases. Here I still get lots of false +ves. >=20 > Any suggestions on how to approach this? Is SpanNear the way to go? Any > other ideas on why I get 0 results? >=20 > Thanks in advance for helping a newbie. >=20 > AS --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org