lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: Removing similar documents from search results
Date Mon, 14 Mar 2005 19:44:58 GMT
Otis Gospodnetic wrote:

> The problem with 2c is that scores are currently relative, and not
> absolute.  I am hoping Chuck's patch makes it into the source, as
> making scores absolute would be helpful in situations like this one.

Good point.

If the orig MoreLikeThis query allows the source doc to be returned it 
might be used to normalize the scores however...

> 
> Otis
> 
> 
> --- David Spencer <dave-lucene-user@tropo.com> wrote:
> 
>>Miles Barr wrote:
>>
>>
>>>Has anyone tried to remove similar documents from their search
>>
>>results?
>>
>>>It looks like Google does some on the fly filtering of the results,
>>>hiding pages which is thinks are too similar, i.e. when you see:
>>>
>>>"In order to show you the most relevant results, we have omitted
>>
>>some
>>
>>>entries very similar to the 7 already displayed.
>>>If you like, you can repeat the search with the omitted results
>>>included."
>>>
>>>at the bottom of the page.
>>>
>>>Is there anything in Lucene or one of the contrib packages that
>>
>>compares
>>
>>>two documents?
>>
>>Yes, in theory the "similarity" package in the sandbox can help.
>>The code generates a query for a source document to find documents
>>that 
>>are similar to it - the MoreLikeThis class uses the heuristic that 2 
>>docs are similar if they share "interesting" words. "Interesting"
>>words 
>>are words that are common in a source doc but not too common in the 
>>corpus. If you were do do this you'd do something like this:
>>
>>[1] Do your normal query
>>[2] As you loop thru the results, for every doc
>>[2a]	generate a similarity query
>>[2b]	requery the index for similar docs
>>[2c]	then, maybe, for every doc from [2b] with a score above some 
>>threshold, it it's also high up in the results from [2] then "hide"
>>the 
>>doc a la google et. al.
>>
>>Could be tricky coding. Another way is to only show 1 doc from any
>>given 
>>domain. Note that instead of 1 query you'll have "1+n" queries for
>>the 
>>display of "n" search results.
>>
>>
>>
>>
>>Similarity links:
>>
>>Source control:
>>
>>
>>http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/
>>
>>My weblog entry about the code being checked in:
>>
>>	http://searchmorph.com/weblog/index.php?id=44
>>
>>Javadoc of it that I host:
>>
>>
>>
> 
> http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html
> 
>>
>>-- Dave
>>
>>
>>
>>>
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message