lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Re[2]: md5 keyword field issue
Date Mon, 20 Jun 2005 14:48:30 GMT

On Jun 20, 2005, at 9:38 AM, catalin-lucene@dazoot.ro wrote:

> Monday, June 20, 2005, 3:55:36 PM, Erik Hatcher wrote:
>
>> Filters reduce the search space to a subset of the documents in the
>> index.  Which document would you want returned when there are
>> multiple documents in the index with the same MD5?  Or do you want to
>> cluster them by MD5?
>>
>
> i think cluster by md5 is more appropriate.
>
>
>> Do you want to cluster them by MD5 perhaps, but still return multiple
>> documents back from a search?
>>
>
> i want to return just the 1st image (the more relevant one). no use to
> show duplicates in an image search app.

Now you've just said the same conflicting thing a different way.  You  
want to cluster but only return one.  :)

If you only want one image returned, then it seems that only indexing  
the same image once is the way to go.  When you find a duplicate MD5,  
don't index that as a second document.  You will, instead, update the  
document by adding additional ALT text and perhaps the additional URL.

Is there a reason why indexing each unique image (by MD5) is not a  
good way to go in your case?

> in sql this would be:
> select distinct md5, url, alt from table group by md5 order by  
> score asc;

This would give you multiple records for the same MD5.  You said  
above you only want one per MD5.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message