lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: Removing similar documents from search results
Date Mon, 14 Mar 2005 18:24:32 GMT
Miles Barr wrote:

> Has anyone tried to remove similar documents from their search results?
> It looks like Google does some on the fly filtering of the results,
> hiding pages which is thinks are too similar, i.e. when you see:
> 
> "In order to show you the most relevant results, we have omitted some
> entries very similar to the 7 already displayed.
> If you like, you can repeat the search with the omitted results
> included."
> 
> at the bottom of the page.
> 
> Is there anything in Lucene or one of the contrib packages that compares
> two documents?

Yes, in theory the "similarity" package in the sandbox can help.
The code generates a query for a source document to find documents that 
are similar to it - the MoreLikeThis class uses the heuristic that 2 
docs are similar if they share "interesting" words. "Interesting" words 
are words that are common in a source doc but not too common in the 
corpus. If you were do do this you'd do something like this:

[1] Do your normal query
[2] As you loop thru the results, for every doc
[2a]	generate a similarity query
[2b]	requery the index for similar docs
[2c]	then, maybe, for every doc from [2b] with a score above some 
threshold, it it's also high up in the results from [2] then "hide" the 
doc a la google et. al.

Could be tricky coding. Another way is to only show 1 doc from any given 
domain. Note that instead of 1 query you'll have "1+n" queries for the 
display of "n" search results.




Similarity links:

Source control:

	http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/

My weblog entry about the code being checked in:

	http://searchmorph.com/weblog/index.php?id=44

Javadoc of it that I host:

	http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html


-- Dave


> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message