lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen>
Subject Query time document group boosting
Date Wed, 26 Nov 2008 15:54:23 GMT
We use Lucene at our library for indexing from different sources into
the same logical index. The sources are very diverse and are prioritized
differently at index-time with document boosts. However, different
groups of users (or individual users for that matter) have different
preferences for the relevancy of the sources, which clashes with
index-time boosting. Query-time tweaking would be preferable.

My coworker Mikkel Kamstrup Erlandsen had this bright and slightly scary

Suppose we have sources A-Z. For each document from the sources, we add
the term groupboost_<source>:dummy. All documents from source A has
groupboost_A:dummy, all documents from source B has groupboost_B:dummy
and so on.

Now, whenever the user enters a query, we parse it the normal way and
wrap it in a BooleanQuery where we add our groupterm_<source>:dummy as
TermQueries with boosts specified by the user (or more realistically
under the hood by the front-end for the user).

Example: Let's say we have a user that love all things from source A and
hates the ones from source C. The front-end knows this. The user enters
the query "foo" which expands to 
"foo OR groupboost_A:dummy^10 OR groupboost_C:dummy^0.1"
The result should be that there's a high probability that the first hits
will come from source A, unless there are significantly better matches
from other groups. Likewise hits from group C will probably be near the
end of the list of hits.

Presto! The user gets what he wants, practically no search-time penalty,
simple. One obvious limitation is that we don't want too many groups aka
sources for this, but in reality we're talking 10-30 groups, so I don't
see that as a problem.

So what's the scary part? I don't know, I just have a feeling that Here
Be Dragons. It seems that it should work without messing with ranking,
besides the specific boost of course, as all documents match exactly one
"groupboost_<source>:dummy"-query, but I would like to hear the opinion
of more seasoned Lucene users. Is it a sensible way to approach the

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message