mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Conwell <j...@iamjohn.me>
Subject Re: Item (text) deduplication
Date Mon, 15 Apr 2013 17:52:31 GMT
Depends on what kind of deduplication your trying to do.  Do you want exact
dup detection?  Or near dup detection?

For exact dup you dont need Mahout.  Just run each doc through a mapper,
where the mapper does a MD5 hash on the doc, and emit the MD5 hash value as
the mapper key, and the doc id as the mapper value.  Then the reducer will
pull all the documents together that have the same MD5 hash value.

If you want to do a near dup analysys, you can go with a ngram shingling
analysys.  I dont think there is anything built into Mahout that does this,
you can use Mahout's ngram generation, and specify a very low Log
Likelyhood score so most/all of the ngrams get emitted.  Then use this
ngram data in your shingling algorithm.  There are several known shingling
algorithms out there, just google them, and implement.





On Mon, Apr 15, 2013 at 6:38 AM, xdcfff <xdcfff@gmail.com> wrote:

> Hi all,
>
> Just looking for some general guidance on how I would approach this task.
>
> If I have two datasets containing items, what is currently the best way to
> detect duplicates between them using Mahout? I intend on matching based on
> item name text similarity to begin with.
>
> I'm willing to write Java wherever necessary, but I just want to be sure to
> avoid "re-coding the wheel" as such.
>
> Cheers,
> -dcf
>



-- 

Thanks,
John C

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message