lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chong, Herb" <>
Subject RE: Filtering out duplicate documents...
Date Mon, 08 Mar 2004 21:42:39 GMT
that kind of fuzzy equality is an area of open research. you need to define what is an acceptable
error rate for Type 1 and Type 2 errors before you can think about implementations that scale
better. approaches range from identifying document vocabulary and statistics to raw hashing
of the input text.


-----Original Message-----
From: Michael Giles []
Sent: Monday, March 08, 2004 4:38 PM
To: Lucene Users List
Subject: Filtering out duplicate documents...

Obviously you can compute the Levenstein distance on the text, but that is 
way too computationally intensive to scale.  So the goal is to find 
something that would be workable in a production system.  For example, a 
given NYT article, and its printer friendly version should be deemed to be 
the same.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message