lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Find Me" <findm...@gmail.com>
Subject near duplicates
Date Tue, 17 Oct 2006 15:54:07 GMT
How to eliminate near duplicates from the index? Someone suggested that I
could look at the TermVectors and do a comparision to remove the duplicates.
One major problem with this is the structure of the document is no longer
important. Are there any obvious pitfalls? For example: Document A being a
subset of Document B but in no particular order.

Nutch's DeleteDuplicates class is useful only when the documents are
identical with respect to either URL or the content.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message