lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From qibaoyuan <>
Subject Re:Finding the most matching (cf. similar) document to another one
Date Fri, 07 Sep 2012 07:43:21 GMT

              MAYBE you could alter MLT to make him working on AND operator.But,i don't think
thers is anything wrong with using OR opearator.Lucne will rank all the docs depending on
the undeylying similarity algorithem(SVM,BM25 etc.).Just as you case,Docs2 will be rank firstly
because it matches the most words in DOC4 . Further, other docs containing SOME words in DOC4
may be listed too, but will get lower score.

At 2012-09-07 15:32:23,"Jochen Hebbrecht" <> wrote:
>Imagine you are indexing the following documents (every line is stored in 1
>single field, analyzed with the default StandardAnalyzer):
>- Doc 1: restaurant 't Robbeke fish passoa beer 15 EUR 5 EUR 2 EUR total 22
>- Doc 2: restaurant De Genieter scampi's fish sticks cola fanta 18 EUR 15
>EUR 2 EUR 2 EUR total 37 EUR
>- Doc 3: restaurant 't Stoveke frites meat beer 10 EUR 5 EUR total 15 EUR
>Now, I have a following document with the following field:
>- Doc 4: restaurant De Genieter VAT 37 EUR
>I'm wondering if Lucene has a feature to find the "most-matching" document.
>In my example, the "most-matching" document for Doc 4 is
>Doc 2.
>I've played around with "MoreLikeThis", but this seems to be creating a
>query with an OR operator for each term. So it created something like this
>"restaurant OR de OR genieter OR VAT OR VAT OR 37 EUR".
>Lucene has to be matching on "restaurant" AND "de" AND "genieter" AND "37"
>AND "EUR". Well, it shouldn't be really AND'ing all terms, because I'm
>looking for the best match. And it could be some term should be removed
>from the list, to get to the best match.
>Maybe it can generate a kind of percentage/scoring to tell me which
>document is the closest to Doc 4? Does Lucene have this kind of feature?
>Thanks in advance for any answer,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message