lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jochen Hebbrecht <jochenhebbre...@gmail.com>
Subject Re: Finding the most matching (cf. similar) document to another one
Date Fri, 07 Sep 2012 08:12:11 GMT
Hi qibaoyuan,

I tried your second solution, using the scoring data. I think in this way,
I could use MoreLikeThis. All documents with a score > X are a possible
match :-).

Thanks!
Jochen



2012/9/7 qibaoyuan <qibaoyuan@126.com>

>
>               MAYBE you could alter MLT to make him working on AND
> operator.But,i don't think thers is anything wrong with using OR
> opearator.Lucne will rank all the docs depending on the undeylying
> similarity algorithem(SVM,BM25 etc.).Just as you case,Docs2 will be rank
> firstly because it matches the most words in DOC4 . Further, other docs
> containing SOME words in DOC4 may be listed too, but will get lower score.
>
>
> At 2012-09-07 15:32:23,"Jochen Hebbrecht" <jochenhebbrecht@gmail.com>
> wrote:
> >Hi,
> >
> >Imagine you are indexing the following documents (every line is stored in
> 1
> >single field, analyzed with the default StandardAnalyzer):
> >- Doc 1: restaurant 't Robbeke fish passoa beer 15 EUR 5 EUR 2 EUR total
> 22
> >EUR
> >- Doc 2: restaurant De Genieter scampi's fish sticks cola fanta 18 EUR 15
> >EUR 2 EUR 2 EUR total 37 EUR
> >- Doc 3: restaurant 't Stoveke frites meat beer 10 EUR 5 EUR total 15 EUR
> >
> >Now, I have a following document with the following field:
> >- Doc 4: restaurant De Genieter VAT 37 EUR
> >
> >I'm wondering if Lucene has a feature to find the "most-matching"
> document.
> >In my example, the "most-matching" document for Doc 4 is
> >Doc 2.
> >I've played around with "MoreLikeThis", but this seems to be creating a
> >query with an OR operator for each term. So it created something like this
> >"restaurant OR de OR genieter OR VAT OR VAT OR 37 EUR".
> >Lucene has to be matching on "restaurant" AND "de" AND "genieter" AND "37"
> >AND "EUR". Well, it shouldn't be really AND'ing all terms, because I'm
> >looking for the best match. And it could be some term should be removed
> >from the list, to get to the best match.
> >
> >Maybe it can generate a kind of percentage/scoring to tell me which
> >document is the closest to Doc 4? Does Lucene have this kind of feature?
> >
> >Thanks in advance for any answer,
> >Jochen
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message