Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 94E9CE1A4 for ; Tue, 29 Jan 2013 19:50:56 +0000 (UTC) Received: (qmail 30302 invoked by uid 500); 29 Jan 2013 19:50:54 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 30099 invoked by uid 500); 29 Jan 2013 19:50:54 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 30089 invoked by uid 99); 29 Jan 2013 19:50:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Jan 2013 19:50:54 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of SRS0=LUcghr=LW=basetechnology.com=jack@yourhostingaccount.com designates 65.254.253.56 as permitted sender) Received: from [65.254.253.56] (HELO mailout07.yourhostingaccount.com) (65.254.253.56) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Jan 2013 19:50:48 +0000 Received: from mailscan07.yourhostingaccount.com ([10.1.15.7] helo=mailscan07.yourhostingaccount.com) by mailout07.yourhostingaccount.com with esmtp (Exim) id 1U0HCN-0006G6-Pj for java-user@lucene.apache.org; Tue, 29 Jan 2013 14:50:27 -0500 Received: from impout01.yourhostingaccount.com ([10.1.55.1] helo=impout01.yourhostingaccount.com) by mailscan07.yourhostingaccount.com with esmtp (Exim) id 1U0HCM-0002qm-RX for java-user@lucene.apache.org; Tue, 29 Jan 2013 14:50:26 -0500 Received: from authsmtp17.yourhostingaccount.com ([10.1.18.17]) by impout01.yourhostingaccount.com with NO UCE id tvqS1k00C0N5tVm01vqSV2; Tue, 29 Jan 2013 14:50:26 -0500 X-Authority-Analysis: v=2.0 cv=EJGEIilC c=1 sm=1 a=yH02RjTyxywMAIqhn74x1Q==:17 a=aQzbgH187woA:10 a=nJUqsKCoQKQA:10 a=3jZET7lWBKwA:10 a=8nJEP1OIZ-IA:10 a=jvYhGVW7AAAA:8 a=HVxvFpnHpOUA:10 a=mV9VRH-2AAAA:8 a=WYUekVgCEsT9cHySOYcA:9 a=wPNLvfGTeEIA:10 a=EMlJoiak7gQA:10 a=88iI8knYSJUA:10 a=8LujXawukIK26S7H:21 a=ff861PcC_JKwyr4i:21 a=ayZJSlMgWVhgG3n+ZwULew==:117 X-EN-OrigOutIP: 10.1.18.17 X-EN-IMPSID: tvqS1k00C0N5tVm01vqSV2 Received: from 207-237-113-14.c3-0.nyr-ubr1.nyr.ny.cable.rcn.com ([207.237.113.14] helo=JackKrupansky) by authsmtp17.yourhostingaccount.com with esmtpa (Exim) id 1U0HCL-0002Vf-So for java-user@lucene.apache.org; Tue, 29 Jan 2013 14:50:26 -0500 Message-ID: <8B5235E8060F4FE6B78BEC5275B6FCA1@JackKrupansky> From: "Jack Krupansky" To: References: <7BFFE07077B64238BF94592798403F43@JackKrupansky><18CE550667164D328619C39B3BA871E1@JackKrupansky> In-Reply-To: Subject: Re: Questions about FuzzyQuery in Lucene 4.x Date: Tue, 29 Jan 2013 14:50:23 -0500 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal Importance: Normal X-Mailer: Microsoft Windows Live Mail 15.4.3555.308 X-MimeOLE: Produced By Microsoft MimeOLE V15.4.3555.308 X-EN-UserInfo: e0a4b55451ed9f27313ebf02e3d4348d:fc4a93e1349e680c52bdd723c0ab3ef6 X-EN-AuthUser: jack@basetechnology.com Sender: "Jack Krupansky" X-EN-OrigIP: 207.237.113.14 X-EN-OrigHost: 207-237-113-14.c3-0.nyr-ubr1.nyr.ny.cable.rcn.com X-Virus-Checked: Checked by ClamAV on apache.org I'm sorry, but for anybody to help you here, you really need to be able to provide a concise test case, like 10-20 lines of code, completely self-contained. If you think you need a million documents to repro what you claimed was a simple scenario, then you leave me very, very confused - and unable to help you any further. -- Jack Krupansky -----Original Message----- From: George Kelvin Sent: Tuesday, January 29, 2013 2:43 PM To: java-user@lucene.apache.org Subject: Re: Questions about FuzzyQuery in Lucene 4.x Hi Jack, The problematic query is "scar"+"wads". There are several (more than 10) documents in the data with the content "star wars", so I think that query should be able to find all these documents. I was trying to provide a minimal test case, but I couldn't reduce the size of data showing the failure. The size of the minimal data showing the failure I got so far is around 2 million. However, I found a suspicious document with content "scor". If I remove it from the 2 million documents data, that query can find all the "star wars" documents. If I add it back, then the query can't find any. I tried to reduce the size of the data to 1 million further and add that "scor" document, but now the query can still find all the "star wars" documents. Is it possible that Lucene somehow fail to find all the valid terms within the edit distance? Thanks! George On Tue, Jan 29, 2013 at 10:02 AM, Jack Krupansky wrote: > I also noticed that you have "MUST" for your full string of fuzzy terms - > that means everyone of them must appear in an indexed document to be > matched. Is it possible that maybe even one term was not in the same > indexed document? > > Try to provide a complete example that shows the input data and the query > - all the literals. In other words, construct a minimal test case that > shows the failure. > > > -- Jack Krupansky > > -----Original Message----- From: George Kelvin > Sent: Tuesday, January 29, 2013 12:28 PM > > To: java-user@lucene.apache.org > Subject: Re: Questions about FuzzyQuery in Lucene 4.x > > Hi Jack, > > ed is set to 1 here and I have lowercased all the data and queries. > > Regarding the indexed data factor you mentioned, can you elaborate more? > > Thanks! > > George > > > On Tue, Jan 29, 2013 at 9:10 AM, Jack Krupansky * > *wrote: > > That depends on the value of "ed", and the indexed data. >> >> Another factor to take into consideration is that a case change ("Star" >> vs. "star") also counts as an edit. >> >> -- Jack Krupansky >> >> -----Original Message----- From: George Kelvin >> Sent: Tuesday, January 29, 2013 11:49 AM >> To: java-user@lucene.apache.org >> Subject: Re: Questions about FuzzyQuery in Lucene 4.x >> >> >> Hi Jack, >> >> Thanks for your reply! >> >> I don't think I passed the prefixLength parameter in. >> >> Here is the code I used to build the FuzzyQuery: >> >> String[] words = str.split("\\+"); >> BooleanQuery query = new BooleanQuery(); >> >> for (int i=0; i> { >> Term t = new Term(field, words[i]); >> FuzzyQuery fq = new FuzzyQuery(t, ed); >> query.add(fq, BooleanClause.Occur.MUST); >> } >> >> int k = 10; >> TopDocs results = searcher.search(query, k); >> >> Does it look right to you? >> >> Thanks! >> >> George >> >> ------------------------------****----------------------------** >> --**--------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.****apache.org< >> java-user-**unsubscribe@lucene.apache.org >> > >> For additional commands, e-mail: java-user-help@lucene.apache.****org< >> java-user-help@lucene.**apache.org > >> >> >> > > ------------------------------**------------------------------**--------- > To unsubscribe, e-mail: > java-user-unsubscribe@lucene.**apache.org > For additional commands, e-mail: > java-user-help@lucene.apache.**org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org