lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Checking for duplicates inside index
Date Wed, 24 May 2006 07:59:23 GMT
Hannes Carl Meyer wrote:
> Ken Krugler schrieb:
>>> On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
>>>>
>>>  > I'm indexing ~10000 documents per day but since I'm getting a lot of
>>>>  real duplicates (100% the same document content) I want to check the
>>>>  content before indexing...
>>>>
>>>  > My idea is to create a checksum of the documents content and 
>>> store it
>>>  > within document inside the index, before indexing a new document I
>>>  > will compare the new documents checksum with the ones in the index.
>>>  >
>>>>  Is that a good idea? does someone have experiences with that method?
>>>>  any tools available?
>>>
>>> That could work.
>>>
>>> You will need a big sum though. MD5?
>>
>> Just as a reference, Nutch uses an MD5 digest to detect duplicate web 
>> pages. It works fine, except of course when two docs differ by only 
>> an insignificant text delta. There's some recent work in this area - 
>> check out TextProfileSignature.
>>
>> -- Ken
> Hi,
>
> thank you very much - I'm currently checking the possibilities and I 
> found an interesting "hash algorithm" called nilsimsa 
> http://ixazon.dynip.com/~cmeclax/nilsimsa.html (still searching for a 
> java implementation).

Please read the archives of nilsimsa mailing list - the algorithm was 
found to be problematic.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message