lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Returning a minimum number of clusters
Date Mon, 01 May 2006 19:03:57 GMT

On May 1, 2006, at 10:38 AM, Doug Cutting wrote:

> Nutch implements host-deduping roughly as follows:
>
> To fetch the first 10 hits it first asks for the top-scoring 20 or  
> so. Then it uses a field cache to reduce this to just two from each  
> host. If it runs out of raw hits, then it re-runs the query, this  
> time for the top scoring 40 hits.  But the query is modified this  
> time to exclude matches from hosts that have already returned more  
> than two hits. (Nutch also automatically converts clauses like "- 
> host:foo.com" into cached filters when "foo.com" occurs in more  
> than a certain percentage of documents.)

Is that an optimization which only works for Nutch and hosts, or is  
it something that could be generalized and implemented sanely in Lucene?

> Thus, in the worst case, it could take five queries to return the  
> top ten hits, but in practice I've never seen more than three, and  
> the re-query rate is usually quite low.  Since raw hits are cheap  
> to compute, and, with a field cache, the host filtering is also  
> fast, to reduce the raw query rate one can simply start by  
> searching for a larger number of raw hits, with little performance  
> impact.

Great, thanks, it's good to know that in practice rerunning the  
queries is not much of a concern.

> BTW, clustering in Information Retrieval usually implies grouping  
> by vector distance using statistical methods:
>
> http://en.wikipedia.org/wiki/Data_clustering

Exactly.  I'd scanned this, but I haven't yet familiarized myself  
with the different models.

It may be possible for both keyword fields e.g. "host" and non- 
keyword fields e.g. "content" to be clustered using the same  
algorithm and an interface like Hits.cluster(String fieldname, int  
docsPerCluster).  Retrieve each hit's vector for the specified field,  
and map the docs into a unified term space, then cluster.   For  
"host" or any other keyword field, the boundaries will be stark and  
the cost of calculation negligible.  For "content", a more  
sophisticated model will be required to group the docs and the cost  
will be greater.

It is more expensive to calculate similarity based on the entire  
document's contents rather than just a snippet chosen by the  
Highlighter.  However, it's presumably more accurate, and having the  
Term Vectors pre-built at index time should help quite a bit.  As the  
number of terms increases, there is presumably a point at which the  
cost becomes too great, but it might be a pretty large number of  
terms.  Dunno yet.  It might make sense to have a "clusterContent"  
field which is a truncated version of "content", which is vectored  
but neither stored nor indexed.

After that, there's also the issue of generating cluster labels.   
Lots of problems to be solved.  But it seems to me that if the term  
vectors are already there, that's an excellent start -- and if you're  
using them for highlighting, you get the disk seeks for free.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message