lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Re[2]: md5 keyword field issue
Date Mon, 20 Jun 2005 14:48:30 GMT

On Jun 20, 2005, at 9:38 AM, wrote:

> Monday, June 20, 2005, 3:55:36 PM, Erik Hatcher wrote:
>> Filters reduce the search space to a subset of the documents in the
>> index.  Which document would you want returned when there are
>> multiple documents in the index with the same MD5?  Or do you want to
>> cluster them by MD5?
> i think cluster by md5 is more appropriate.
>> Do you want to cluster them by MD5 perhaps, but still return multiple
>> documents back from a search?
> i want to return just the 1st image (the more relevant one). no use to
> show duplicates in an image search app.

Now you've just said the same conflicting thing a different way.  You  
want to cluster but only return one.  :)

If you only want one image returned, then it seems that only indexing  
the same image once is the way to go.  When you find a duplicate MD5,  
don't index that as a second document.  You will, instead, update the  
document by adding additional ALT text and perhaps the additional URL.

Is there a reason why indexing each unique image (by MD5) is not a  
good way to go in your case?

> in sql this would be:
> select distinct md5, url, alt from table group by md5 order by  
> score asc;

This would give you multiple records for the same MD5.  You said  
above you only want one per MD5.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message