lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miles Barr <mi...@runtime-collective.com>
Subject Re: Removing similar documents from search results
Date Tue, 15 Mar 2005 10:34:08 GMT
On Mon, 2005-03-14 at 10:24 -0800, David Spencer wrote:
> Yes, in theory the "similarity" package in the sandbox can help.
> The code generates a query for a source document to find documents that 
> are similar to it - the MoreLikeThis class uses the heuristic that 2 
> docs are similar if they share "interesting" words. "Interesting" words 
> are words that are common in a source doc but not too common in the 
> corpus. If you were do do this you'd do something like this:
> 
> [1] Do your normal query
> [2] As you loop thru the results, for every doc
> [2a]	generate a similarity query
> [2b]	requery the index for similar docs
> [2c]	then, maybe, for every doc from [2b] with a score above some 
> threshold, it it's also high up in the results from [2] then "hide" the 
> doc a la google et. al.
> 
> Could be tricky coding. Another way is to only show 1 doc from any given 
> domain. Note that instead of 1 query you'll have "1+n" queries for the 
> display of "n" search results.

That sounds like an interesting approach. But I'll probably wait until
Chuck's patch is included. I'm also a bit worried about the performance
of this approach. It might add too much time to each query.



-- 
Miles Barr <miles@runtime-collective.com>
Runtime Collective Ltd.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message