hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Stein <crypt...@gmail.com>
Subject De-Duplication Technique
Date Wed, 24 Mar 2010 20:34:16 GMT
I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).

The data sets we have are littered with repeats of data from mobile
devices which continue to come in over time (so we may see duplicates
of data re-posted months after it originally posted...)

I have 2 ways so far I can go about it (one way I do in production
without Hadoop) and interested to see if others have faced/solved this
in Hadoop/HDFS and what their experience might be.

1) handle my own hash filter (where I continually store and look up a
hash (MD5, bloom, whatever) of the data I am aggregating on as
existing already).  We do this now without Hadoop perhaps a variant
can be ported into HDFS as map task, reducing the results to files and
restoring the hash table (maybe in Hive or something, dunno yet)
2) push the data into Cassandra (our NoSQL solution of choice) and let
that hash/map system do it for us.   As I get more into Hadoop looking
at HBas is tempting but then just one more thing to learn.

I would really like to not have to reinvent a wheel here and even
contribute if something is going on as it is a use case in our work
effort.

Thanx in advance =8^)

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Mime
View raw message