lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hannes Carl Meyer <develop...@rc.ag>
Subject Re: Checking for duplicates inside index
Date Wed, 24 May 2006 07:42:08 GMT
Ken Krugler schrieb:
>> On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
>>>
>>  > I'm indexing ~10000 documents per day but since I'm getting a lot of
>>>  real duplicates (100% the same document content) I want to check the
>>>  content before indexing...
>>>
>>  > My idea is to create a checksum of the documents content and store it
>>  > within document inside the index, before indexing a new document I
>>  > will compare the new documents checksum with the ones in the index.
>>  >
>>>  Is that a good idea? does someone have experiences with that method?
>>>  any tools available?
>>
>> That could work.
>>
>> You will need a big sum though. MD5?
>
> Just as a reference, Nutch uses an MD5 digest to detect duplicate web 
> pages. It works fine, except of course when two docs differ by only an 
> insignificant text delta. There's some recent work in this area - 
> check out TextProfileSignature.
>
> -- Ken
Hi,

thank you very much - I'm currently checking the possibilities and I 
found an interesting "hash algorithm" called nilsimsa 
http://ixazon.dynip.com/~cmeclax/nilsimsa.html (still searching for a 
java implementation).

H.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message