lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jochen Hebbrecht <jochenhebbre...@gmail.com>
Subject Finding the most matching (cf. similar) document to another one
Date Fri, 07 Sep 2012 07:32:23 GMT
Hi,

Imagine you are indexing the following documents (every line is stored in 1
single field, analyzed with the default StandardAnalyzer):
- Doc 1: restaurant 't Robbeke fish passoa beer 15 EUR 5 EUR 2 EUR total 22
EUR
- Doc 2: restaurant De Genieter scampi's fish sticks cola fanta 18 EUR 15
EUR 2 EUR 2 EUR total 37 EUR
- Doc 3: restaurant 't Stoveke frites meat beer 10 EUR 5 EUR total 15 EUR

Now, I have a following document with the following field:
- Doc 4: restaurant De Genieter VAT 37 EUR

I'm wondering if Lucene has a feature to find the "most-matching" document.
In my example, the "most-matching" document for Doc 4 is
Doc 2.
I've played around with "MoreLikeThis", but this seems to be creating a
query with an OR operator for each term. So it created something like this
"restaurant OR de OR genieter OR VAT OR VAT OR 37 EUR".
Lucene has to be matching on "restaurant" AND "de" AND "genieter" AND "37"
AND "EUR". Well, it shouldn't be really AND'ing all terms, because I'm
looking for the best match. And it could be some term should be removed
from the list, to get to the best match.

Maybe it can generate a kind of percentage/scoring to tell me which
document is the closest to Doc 4? Does Lucene have this kind of feature?

Thanks in advance for any answer,
Jochen

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message