lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rich Heimann <>
Subject Duplicate documents in a corpus
Date Thu, 28 Jul 2011 15:49:51 GMT

I am curious if Lucene and/or Mahout can identify duplicate documents? I am
having trouble with many redundant docs in my corpus, which is causing
inflated values and an expense on users to process and reprocess much of the
material. Can the redundancy be removed or managed in some sense my either
Lucene at ingestion or Mahout at post-processing? The Vector Space Model
seems to be notional similar to PCA or Factor Analysis, which both have
similar ambitions. Thoughts???

Thank you in advance....

Rich Heimann

Richard Heimann

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message