lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: Removing similar documents from search results
Date Mon, 14 Mar 2005 18:24:32 GMT
Miles Barr wrote:

> Has anyone tried to remove similar documents from their search results?
> It looks like Google does some on the fly filtering of the results,
> hiding pages which is thinks are too similar, i.e. when you see:
> "In order to show you the most relevant results, we have omitted some
> entries very similar to the 7 already displayed.
> If you like, you can repeat the search with the omitted results
> included."
> at the bottom of the page.
> Is there anything in Lucene or one of the contrib packages that compares
> two documents?

Yes, in theory the "similarity" package in the sandbox can help.
The code generates a query for a source document to find documents that 
are similar to it - the MoreLikeThis class uses the heuristic that 2 
docs are similar if they share "interesting" words. "Interesting" words 
are words that are common in a source doc but not too common in the 
corpus. If you were do do this you'd do something like this:

[1] Do your normal query
[2] As you loop thru the results, for every doc
[2a]	generate a similarity query
[2b]	requery the index for similar docs
[2c]	then, maybe, for every doc from [2b] with a score above some 
threshold, it it's also high up in the results from [2] then "hide" the 
doc a la google et. al.

Could be tricky coding. Another way is to only show 1 doc from any given 
domain. Note that instead of 1 query you'll have "1+n" queries for the 
display of "n" search results.

Similarity links:

Source control:

My weblog entry about the code being checked in:

Javadoc of it that I host:

-- Dave


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message