lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <>
Subject Re: Removing similar documents from search results
Date Mon, 14 Mar 2005 19:48:41 GMT

I think what they do at Google is a fancy heuristic -- as David Spencer 
mentioned, suburls of a given page, identical snippets, or titles... My 
idea was more towards providing a 'realistic overview' of subjects in 
pages. So you could pick, say, the first document from each cluster and 
show them like that to the user. Then, in every cluster documents 
already have mutual similarity (this could be calculated manually, the 
clustering algorithm doesn't do it for all pairs of documents), but some 
have more and some have less. You could then hide nearly identical 
results from the user.

Anyway, I think the Google method is just a heuristic based on URLs and 
nothing as fancy.


Miles Barr wrote:
> Hi Dawid,
> On Mon, 2005-03-14 at 18:55 +0100, Dawid Weiss wrote:
>>I can imagine if you apply clustering to search results anyway then the 
>>information about clusters can help you determine 'similar' results and 
>>reorder the output list.
> That's an interesting idea. How easy is it to 'tighten' the clustering
> clones? So say we take a very narrow cone around each result and any
> other documents within that cone can be considered similar enough, and
> hence not displayed. Then we'd take the document closest to the centre
> of the cloud and make that the 'original' copy and display it.
> Or would that approach be too expensive to calculate for each search?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message