lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <karl.wet...@gmail.com>
Subject Re: near duplicates
Date Wed, 18 Oct 2006 17:37:32 GMT

17 okt 2006 kl. 18.55 skrev Andrzej Bialecki:
> You need to create a fuzzy signature of the document, based on term  
> histogram or shingles - take a look a the Signature framework in  
> Nutch.
>
> There is a substantial literature on this subject - go to Citeseer  
> and run a search for "near duplicate detection".

Interesting. I'll have to check this out a bit more some day(tm).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message