lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Charles Hornberger" <charles.hornber...@gmail.com>
Subject eliminating "too many results from the same source"
Date Sun, 06 Jan 2008 23:25:35 GMT
I've got a problem that I'm not quite sure how to solve and am wondering if
anyone has any insight or similar experience to share.

Here's the situation: Documents in our Solr index include a field
identifying their author (we have 1000s of authors). When displaying an
individual document, we also want to display a list of related documents by
other authors*, so we do a search using the current document's title, author
name, summary, and keywords as the query. Sometimes the search yields a
results set in which all of the top n documents (in reality, n is ~10) are
from one author.

Apparently, people don't like this.

So what is being asked for is a result set in which no more than m (where m
is probably 3) of the top n are from any single author. (It's not that we
want to exclude documents m+1, m+2, etc. by each author from the result set
entirely; we just don't want them in the top n.)

More generically, I can imagine this as a feature that might be occasionally
useful, e.g. as a kind of "diversity boost function" to be used when scoring
results, where you specify the fields for which you want to enforce
diversity (e.g., author name, genre, color, etc.), and provide your values
for n and m, and Solr, uhm, obliges. :-)

Any tips or ideas on how to proceed? (We're using Solr 1.2 so we don't have
MoreLikeThis, but we can upgrade to a newer version if it's likely that
MoreLikeThis can provide what we're looking for.)

-Charlie

* In fact, we wouldn't mind if additional documents by the same author were
included, but we found that when we didn't exclude the original author from
the result set, we almost always had the same problem: The first n documents
were always by the original author.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message