lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walt Stoneburner" <>
Subject Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts
Date Fri, 25 May 2007 13:49:53 GMT
Grant writes:
> Have a look at the DisjunctionMaxQuery, I think it might help,
> although I am not sure it will fully cover your case.

The definition for DisjunctionMaxQuery is provided at this URL:,

Grossly doing editorial cuts of the synopsis text, we end up with this
simplified description:
'This is useful when searching for a word in multiple fields ... if
the query is "albino elephant" this ensures that "albino" matching one
field and "elephant" matching another gets a higher score than
"albino" matching both fields'

First off, thanks Grant -- I hadn't even considered the possibility of
what happens if multiple fields in the _same_ document matched.
That's an intriguing case, indeed.

However, for my particular dataset, I only have the one field
containing the contents of the document, so unless I've missed an
alternate way of using it, I'm not how I should apply it to my
specific case.

For clarification, what I'm trying to do is make sure that if a
document uses a single term many times, that it doesn't drown out a
document that uses more search terms, though less frequently, when the
scores are returned.

Take a document that says: "Albino. Albino. Albino. Albino. Albino.
Albino. Albino!"  Right there, that's seven hits on albino, so this
must _really_ be a document about albino.

Take a document that says "Albino elephant." and nothing more.  This
only has two keyword hits.

What I want to do is make sure the returns results don't go "Oh, 7 is
more than 2, let's return the Albino document first."

Instead, I'm looking for "This document matched 2 of the things he was
looking for, albino and also elephant, while the other document only
matched 1 of the things he was looking for -- 2 is more than 1, so
give the 'Albino elephant.' the best score."


ps.  I wasn't even aware DisjunctionMaxQuery existed; is there a
resource that describes the purpose of BooleanQuery,
DisjunctionMaxQuery, and others in simple reference?

For instance, if I go to the BooleanQuery page,,
It doesn't even say "sum of the field scores" -- maybe I'm looking in
the wrong place, but for someone new to the API, it's very hard to
figure out what class you want when it's unclear what specific affect
it has on scoring.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message