lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Returning a minimum number of clusters
Date Mon, 01 May 2006 22:51:22 GMT
Marvin Humphrey wrote:
> On May 1, 2006, at 10:38 AM, Doug Cutting wrote:
>> Nutch implements host-deduping roughly as follows:
>> To fetch the first 10 hits it first asks for the top-scoring 20 or  
>> so. Then it uses a field cache to reduce this to just two from each  
>> host. If it runs out of raw hits, then it re-runs the query, this  
>> time for the top scoring 40 hits.  But the query is modified this  
>> time to exclude matches from hosts that have already returned more  
>> than two hits. (Nutch also automatically converts clauses like "- 
>>" into cached filters when "" occurs in more  than 
>> a certain percentage of documents.)
> Is that an optimization which only works for Nutch and hosts, or is  it 
> something that could be generalized and implemented sanely in Lucene?

It's probably generalizeable.

The stuff that optimizes queries into filters is in:

The deduping logic is in:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message