lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Returning a minimum number of clusters
Date Mon, 01 May 2006 17:38:11 GMT
Marvin Humphrey wrote:
> The problem I'm trying to solve is how to return a minimum number of  
> clusters from a search.  Say the most relevant 100 documents for a  
> query are all from the same domain, but you want a maximum of two  
> results per domain, a la Google.  I don't see any alternative to  
> rerunning the query an indeterminate number of times until you've  
> accumulated sufficient clusters, because the search logic doesn't  know 
> what cluster a document belongs to until the document vector is  retrieved.
> Is there a better way?

Nutch implements host-deduping roughly as follows:

To fetch the first 10 hits it first asks for the top-scoring 20 or so. 
Then it uses a field cache to reduce this to just two from each host. 
If it runs out of raw hits, then it re-runs the query, this time for the 
top scoring 40 hits.  But the query is modified this time to exclude 
matches from hosts that have already returned more than two hits. 
(Nutch also automatically converts clauses like "" into 
cached filters when "" occurs in more than a certain percentage 
of documents.) Thus, in the worst case, it could take five queries to 
return the top ten hits, but in practice I've never seen more than 
three, and the re-query rate is usually quite low.  Since raw hits are 
cheap to compute, and, with a field cache, the host filtering is also 
fast, to reduce the raw query rate one can simply start by searching for 
a larger number of raw hits, with little performance impact.

BTW, clustering in Information Retrieval usually implies grouping by 
vector distance using statistical methods:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message