Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of jochenhebbrecht@gmail.com
 designates 209.85.212.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <21e7543b.399a.1399faedf4b.Coremail.qibaoyuan@126.com>
References: 
 <CAJcVXktKq1=pK2UmuG_3vEdaaAWmLRAxSgtcV3is47MBZgWVww@mail.gmail.com>
	<21e7543b.399a.1399faedf4b.Coremail.qibaoyuan@126.com>
Date: Fri, 7 Sep 2012 10:12:11 +0200
Message-ID: 
 <CAJcVXkvtMxoAaexPSOvELXkFPqHd02dC2Ut0yTN2zs94eFoKHA@mail.gmail.com>
Subject: Re: Finding the most matching (cf. similar) document to another one
From: Jochen Hebbrecht <jochenhebbrecht@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=20cf307ca3de35ae9604c9182ad4

--20cf307ca3de35ae9604c9182ad4
Content-Type: text/plain; charset=ISO-8859-1

Hi qibaoyuan,

I tried your second solution, using the scoring data. I think in this way,
I could use MoreLikeThis. All documents with a score > X are a possible
match :-).

Thanks!
Jochen


2012/9/7 qibaoyuan <qibaoyuan@126.com>

>
>               MAYBE you could alter MLT to make him working on AND
> operator.But,i don't think thers is anything wrong with using OR
> opearator.Lucne will rank all the docs depending on the undeylying
> similarity algorithem(SVM,BM25 etc.).Just as you case,Docs2 will be rank
> firstly because it matches the most words in DOC4 . Further, other docs
> containing SOME words in DOC4 may be listed too, but will get lower score.
>
>
> At 2012-09-07 15:32:23,"Jochen Hebbrecht" <jochenhebbrecht@gmail.com>
> wrote:
> >Hi,
> >
> >Imagine you are indexing the following documents (every line is stored in
> 1
> >single field, analyzed with the default StandardAnalyzer):
> >- Doc 1: restaurant 't Robbeke fish passoa beer 15 EUR 5 EUR 2 EUR total
> 22
> >EUR
> >- Doc 2: restaurant De Genieter scampi's fish sticks cola fanta 18 EUR 15
> >EUR 2 EUR 2 EUR total 37 EUR
> >- Doc 3: restaurant 't Stoveke frites meat beer 10 EUR 5 EUR total 15 EUR
> >
> >Now, I have a following document with the following field:
> >- Doc 4: restaurant De Genieter VAT 37 EUR
> >
> >I'm wondering if Lucene has a feature to find the "most-matching"
> document.
> >In my example, the "most-matching" document for Doc 4 is
> >Doc 2.
> >I've played around with "MoreLikeThis", but this seems to be creating a
> >query with an OR operator for each term. So it created something like this
> >"restaurant OR de OR genieter OR VAT OR VAT OR 37 EUR".
> >Lucene has to be matching on "restaurant" AND "de" AND "genieter" AND "37"
> >AND "EUR". Well, it shouldn't be really AND'ing all terms, because I'm
> >looking for the best match. And it could be some term should be removed
> >from the list, to get to the best match.
> >
> >Maybe it can generate a kind of percentage/scoring to tell me which
> >document is the closest to Doc 4? Does Lucene have this kind of feature?
> >
> >Thanks in advance for any answer,
> >Jochen
>

--20cf307ca3de35ae9604c9182ad4--