mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Finding the similarity of documents using Mahout for deduplication
Date Fri, 17 Jul 2009 19:26:40 GMT
I think this comes up fairly often in search apps, duplicate
documents are indexed (for example using SimplyHired's search
there are 20 of the same job listed from different websites). A
similarity score above a threshold would determine the documents
are too similar, are duplicates, and therefore can be removed.
Is there a recommended Mahout algorithm for this?

Mime
View raw message