lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Re[4]: md5 keyword field issue
Date Mon, 20 Jun 2005 16:32:04 GMT

On Jun 20, 2005, at 10:54 AM, wrote:
> Monday, June 20, 2005, 5:48:30 PM, Erik Hatcher wrote:
>> Now you've just said the same conflicting thing a different way.  You
>> want to cluster but only return one.  :)
> i think i missunderstood here the Term: cluster.
> so yes, i just want one image returned.

Maybe my interpretation of "cluster" is clouded by the search  
domain.  In the search domain, cluster means grouping multiple things.

>> If you only want one image returned, then it seems that only indexing
>> the same image once is the way to go.  When you find a duplicate MD5,
>> don't index that as a second document.  You will, instead, update the
>> document by adding additional ALT text and perhaps the additional  
>> URL.
> this sounds pretty ok !

The tricks are to do a search when indexing to find duplicates, and  
to "update" the document by deleting and re-adding it (you'll  
probably want to store the field data so you can retrieve it easily  
and use it for the new updated document.

The negative to this approach is you want know specifically which  
page the image was on in results, though you could keep all URL's  
that point to it as a document can have multiple fields named "URL"  
for example.

>>> in sql this would be:
>>> select distinct md5, url, alt from table group by md5 order by
>>> score asc;
>> This would give you multiple records for the same MD5.  You said
>> above you only want one per MD5.
> here i'm afraid you are not correct, because i have GROUP BY MD5
> clause which will return no duplicates.

Sorry, I missed the GROUP BY clause there in my first human parse of  
the expression - I was too busy focusing on DISTINCT.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message