lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Returning a minimum number of clusters
Date Mon, 01 May 2006 22:51:22 GMT
Marvin Humphrey wrote:
> On May 1, 2006, at 10:38 AM, Doug Cutting wrote:
>> Nutch implements host-deduping roughly as follows:
>>
>> To fetch the first 10 hits it first asks for the top-scoring 20 or  
>> so. Then it uses a field cache to reduce this to just two from each  
>> host. If it runs out of raw hits, then it re-runs the query, this  
>> time for the top scoring 40 hits.  But the query is modified this  
>> time to exclude matches from hosts that have already returned more  
>> than two hits. (Nutch also automatically converts clauses like "- 
>> host:foo.com" into cached filters when "foo.com" occurs in more  than 
>> a certain percentage of documents.)
> 
> Is that an optimization which only works for Nutch and hosts, or is  it 
> something that could be generalized and implemented sanely in Lucene?

It's probably generalizeable.

The stuff that optimizes queries into filters is in:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java?view=markup

The deduping logic is in:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/NutchBean.java?view=markup

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message