lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: Removing similar documents from search results
Date Wed, 16 Mar 2005 05:51:27 GMT
Miles Barr wrote:
> On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote:
>>I think what they do at Google is a fancy heuristic -- as David Spencer 
>>mentioned, suburls of a given page, identical snippets, or titles... My 
>>idea was more towards providing a 'realistic overview' of subjects in 
>>pages. So you could pick, say, the first document from each cluster and 
>>show them like that to the user. Then, in every cluster documents 
>>already have mutual similarity (this could be calculated manually, the 
>>clustering algorithm doesn't do it for all pairs of documents), but some 
>>have more and some have less. You could then hide nearly identical 
>>results from the user.
>>Anyway, I think the Google method is just a heuristic based on URLs and 
>>nothing as fancy.
> At the moment I need something quite simple. To identify a page that
> appears in many forms, e.g.:
> - Normal version
> - Split across several pages
> - Print version
> - From a different section (different styling and navigation elements)
> Basically identical content, presented in different ways.
> I beginning to think any on the fly filtering would have to be quite
> simple because of the required response time. Something like looking at
> URLs and titles (which are already separate in the index).
> A more permanent solution would have to process the pages ahead of time,
> trying to break it down into chunks, then maybe built up a list of
> hashes of the chunks for each page. Then each page would have a
> 'fingerprint', and hopefully you could come up with a quick way to
> compare them at query time.

I think the paper "Syntactic Clustering of the Web" describes a 
technique similar to what you suggest:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message