17 okt 2006 kl. 18.55 skrev Andrzej Bialecki:
> You need to create a fuzzy signature of the document, based on term
> histogram or shingles - take a look a the Signature framework in
> Nutch.
>
> There is a substantial literature on this subject - go to Citeseer
> and run a search for "near duplicate detection".
Interesting. I'll have to check this out a bit more some day(tm).
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|