Miles Barr wrote:
> Has anyone tried to remove similar documents from their search results?
> It looks like Google does some on the fly filtering of the results,
> hiding pages which is thinks are too similar, i.e. when you see:
>
> "In order to show you the most relevant results, we have omitted some
> entries very similar to the 7 already displayed.
> If you like, you can repeat the search with the omitted results
> included."
>
> at the bottom of the page.
>
> Is there anything in Lucene or one of the contrib packages that compares
> two documents?
Yes, in theory the "similarity" package in the sandbox can help.
The code generates a query for a source document to find documents that
are similar to it - the MoreLikeThis class uses the heuristic that 2
docs are similar if they share "interesting" words. "Interesting" words
are words that are common in a source doc but not too common in the
corpus. If you were do do this you'd do something like this:
[1] Do your normal query
[2] As you loop thru the results, for every doc
[2a] generate a similarity query
[2b] requery the index for similar docs
[2c] then, maybe, for every doc from [2b] with a score above some
threshold, it it's also high up in the results from [2] then "hide" the
doc a la google et. al.
Could be tricky coding. Another way is to only show 1 doc from any given
domain. Note that instead of 1 query you'll have "1+n" queries for the
display of "n" search results.
Similarity links:
Source control:
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/
My weblog entry about the code being checked in:
http://searchmorph.com/weblog/index.php?id=44
Javadoc of it that I host:
http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html
-- Dave
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|