Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E742DD7A7 for ; Fri, 7 Sep 2012 08:12:47 +0000 (UTC) Received: (qmail 29437 invoked by uid 500); 7 Sep 2012 08:12:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 29156 invoked by uid 500); 7 Sep 2012 08:12:41 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 29109 invoked by uid 99); 7 Sep 2012 08:12:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Sep 2012 08:12:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jochenhebbrecht@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vb0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Sep 2012 08:12:32 +0000 Received: by vbme21 with SMTP id e21so3451178vbm.35 for ; Fri, 07 Sep 2012 01:12:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=nQDh8jAkcfKLtnIJagdB72jCdC8C+Gtg0S+ZDBpE2No=; b=KL+xTjwDLo+LN0QKQEPMxhOFELaX2bNQJOgt8hGveIo35iYIVf3wCf3q0ZbxR5ko6N dDeil9X42/O9kFYHKBGapbZGbLo1dMCzSvzE78RTFC0d7hm5JXR72/dYrFjZgguPh7Ob 8A/epWJsBR7fyWE40BR7D2dZ8Xu5q24AVPAA5DqCexia1ftZp2zUjfdfZ+Fxz+VMPBon CBfcMFYvFGZZ7zP/LYzwo0nqDD5wdTl5IvUCz1YC33HtNhduyN5eWI2HL4xdaIVEqa9j FY+ZPTv9JS7SDpzC1PnCGYhLo5jBjqesLR/ynvL+znZNk1kLUp2MS2OrapPnYXgpVG/s 1lGw== MIME-Version: 1.0 Received: by 10.52.33.139 with SMTP id r11mr5035829vdi.11.1347005531401; Fri, 07 Sep 2012 01:12:11 -0700 (PDT) Received: by 10.220.191.131 with HTTP; Fri, 7 Sep 2012 01:12:11 -0700 (PDT) In-Reply-To: <21e7543b.399a.1399faedf4b.Coremail.qibaoyuan@126.com> References: <21e7543b.399a.1399faedf4b.Coremail.qibaoyuan@126.com> Date: Fri, 7 Sep 2012 10:12:11 +0200 Message-ID: Subject: Re: Finding the most matching (cf. similar) document to another one From: Jochen Hebbrecht To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=20cf307ca3de35ae9604c9182ad4 --20cf307ca3de35ae9604c9182ad4 Content-Type: text/plain; charset=ISO-8859-1 Hi qibaoyuan, I tried your second solution, using the scoring data. I think in this way, I could use MoreLikeThis. All documents with a score > X are a possible match :-). Thanks! Jochen 2012/9/7 qibaoyuan > > MAYBE you could alter MLT to make him working on AND > operator.But,i don't think thers is anything wrong with using OR > opearator.Lucne will rank all the docs depending on the undeylying > similarity algorithem(SVM,BM25 etc.).Just as you case,Docs2 will be rank > firstly because it matches the most words in DOC4 . Further, other docs > containing SOME words in DOC4 may be listed too, but will get lower score. > > > At 2012-09-07 15:32:23,"Jochen Hebbrecht" > wrote: > >Hi, > > > >Imagine you are indexing the following documents (every line is stored in > 1 > >single field, analyzed with the default StandardAnalyzer): > >- Doc 1: restaurant 't Robbeke fish passoa beer 15 EUR 5 EUR 2 EUR total > 22 > >EUR > >- Doc 2: restaurant De Genieter scampi's fish sticks cola fanta 18 EUR 15 > >EUR 2 EUR 2 EUR total 37 EUR > >- Doc 3: restaurant 't Stoveke frites meat beer 10 EUR 5 EUR total 15 EUR > > > >Now, I have a following document with the following field: > >- Doc 4: restaurant De Genieter VAT 37 EUR > > > >I'm wondering if Lucene has a feature to find the "most-matching" > document. > >In my example, the "most-matching" document for Doc 4 is > >Doc 2. > >I've played around with "MoreLikeThis", but this seems to be creating a > >query with an OR operator for each term. So it created something like this > >"restaurant OR de OR genieter OR VAT OR VAT OR 37 EUR". > >Lucene has to be matching on "restaurant" AND "de" AND "genieter" AND "37" > >AND "EUR". Well, it shouldn't be really AND'ing all terms, because I'm > >looking for the best match. And it could be some term should be removed > >from the list, to get to the best match. > > > >Maybe it can generate a kind of percentage/scoring to tell me which > >document is the closest to Doc 4? Does Lucene have this kind of feature? > > > >Thanks in advance for any answer, > >Jochen > --20cf307ca3de35ae9604c9182ad4--