lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Giles <>
Subject Filtering out duplicate documents...
Date Mon, 08 Mar 2004 21:37:47 GMT
I'm looking for a way to filter out duplicate documents from an index 
(either while indexing, or after the fact).  It seems like there should be 
an approach of comparing the terms for two documents, but I'm wondering if 
any other folks (i.e. nutch) have come up with a solution to this problem.

Obviously you can compute the Levenstein distance on the text, but that is 
way too computationally intensive to scale.  So the goal is to find 
something that would be workable in a production system.  For example, a 
given NYT article, and its printer friendly version should be deemed to be 
the same.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message