lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miles Barr <>
Subject Re: Removing similar documents from search results
Date Tue, 15 Mar 2005 10:34:08 GMT
On Mon, 2005-03-14 at 10:24 -0800, David Spencer wrote:
> Yes, in theory the "similarity" package in the sandbox can help.
> The code generates a query for a source document to find documents that 
> are similar to it - the MoreLikeThis class uses the heuristic that 2 
> docs are similar if they share "interesting" words. "Interesting" words 
> are words that are common in a source doc but not too common in the 
> corpus. If you were do do this you'd do something like this:
> [1] Do your normal query
> [2] As you loop thru the results, for every doc
> [2a]	generate a similarity query
> [2b]	requery the index for similar docs
> [2c]	then, maybe, for every doc from [2b] with a score above some 
> threshold, it it's also high up in the results from [2] then "hide" the 
> doc a la google et. al.
> Could be tricky coding. Another way is to only show 1 doc from any given 
> domain. Note that instead of 1 query you'll have "1+n" queries for the 
> display of "n" search results.

That sounds like an interesting approach. But I'll probably wait until
Chuck's patch is included. I'm also a bit worried about the performance
of this approach. It might add too much time to each query.

Miles Barr <>
Runtime Collective Ltd.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message