lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Kleven" <johnkle...@gmail.com>
Subject Request to change "coord" similarity API:
Date Wed, 22 Aug 2007 20:19:07 GMT
I'm hoping that coord similarity API can be changed from:
float coord(int overlap, int maxOverlap)

TO

float coord(int overlap, int maxOverlap, int docSize)

Where docSize is the num Terms in the document/hit being evaluated for
similarity to the query.

The reason for this is that many people are using Lucene to match documents
that are not web pages, and in these cases, the size of the query and the
document MUST be similar sizes.  For example ...

If your documents are cars, and there's a 3 styles of a volvo wagon, say:
 - "Volvo V70 Wagon"   (just the "normal" edition)
 - "Volvo V70 Wagon Luxury Edition"
 - "Volvo V70 Wagon Luxury Edition Sports Pacakge AWD"

If somebody searches for a longer name, like "Volvo V70 Wagon Luxury Edition
Sports Pacakge AWD", then the normal edition "Volvo V70 Wagon" will be
excluded most likely due to the coord factor only having 3/8 hits.

**However**, in the reverse situation, if somebody wants to search for the
normal wagon, "Volvo V70 Wagon", it will match all 3 of these w/ the same
score.  Nothing can help here, changing lengthNorm to intentionally lower
the score of car names as they get longer doesn't make sense, the "Volvo V70
Wagon Luxury Edition Sports Pacakge AWD" is just as much of a car as the
"Volvo V70 Wagon", so the lengthNorm is using the "SweetSpot" or "Plateau"
methodology, and anything between 2 words and about 10 are all legit values.

So, back to my orig request.  By changing coord to also have the length of
the matching document, it would allow coord to lower scores on docs that are
not similar length to the orig query.  Again, searching "Volvo V70 Wagon",
when the hit for "Volvo V70 Wagon Luxury Edition Sports Pacakge AWD", is
analyzed, the coord would tell me that it has 8 terms, vs the 3 that i'm
looking for, and then i could apply any algorithm i want to reduce the hit
score (in this case, most likely returning 3/8).  However, if your
application does consider those hits all the same, then u could leave its
current implementation as is, and return a 1.

Hopefully this makes sense.  I'm (sort of) aware that this could be coded up
myself by doing a custom query and scorer class, but I think it warrants
being added to the abstract similarity class.  I'm not a pro on lucene so I
could be missing something, thank you for reading.

Sincerely,
John

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message