hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankur C. Goel" <gan...@yahoo-inc.com>
Subject Re: DeDuplication Techniques
Date Fri, 26 Mar 2010 11:31:32 GMT
Yes that is the next logical step in performance optimization :-)
When you have no historical data to begin with, it makes little difference.
However, as the volume of historical data grows with time the gains become
more evident.

As for sort performance, only the new data will need sorting as historical data
is already kept sorted so the performance shouldn't be a problem.

So the steps become like this.

First M/R job -> Sort new data.
Second Map-only  job -> Merge sorted historical data with sorted new data.


On 3/26/10 3:35 PM, "Andrew Klochkov" <diggerk@gmail.com> wrote:

Would it be a good optimization to have historical data (stored in HDFS)
sorted by the primary key, and also sort new data before joining? I guess in
this case join can be performed more effective (in a InputFormat
implementation), avoiding sort/shuffle/copy-to-reducer part.
The only double I have is how fast can data be sorted, wouldn't it kill all
the optimization.

On Fri, Mar 26, 2010 at 7:20 AM, Ankur C. Goel <gankur@yahoo-inc.com> wrote:

> The kind of need to you specified is quite common in ETL style of
> processing. The fastest and most efficient way to do this is when you have
> all your historical data in HDFS itself. In this case you can do a LEFT
> outer join between the two datasets (assuming new data is your left
> relation) in map-reduce without querying a database or any other persistent
> store. Then you would keep only the records which have fields from the left
> relation and NOT the right relation (historical data).
> A join can be easily implemented in map-reduce using the secondary sort
> trick. Basically you can specify different mappers for different input data
> in the same M/R job and in each mapper tag the record key with relation ids
> (0, 1...). This makes sure that records from one relation for matching key
> appear before the record of other relation in reducer. You then cache them
> in memory and do a cross of this with each record of the new relation you
> see.
> This might sound more complicated then it really is. Hadoop has sample code
> under examples for secondary sort but no code for join.
> Another option is to use a high level languages like Pig or HIVE that
> provide join operations and also expose extensions to take care of skew in
> data i.e data getting divided unevenly die to few keys having majority of
> records. This is the simplest and quickest (in terms of developer
> productivity) IMO.
> Regards
> -@nkur
> On 3/26/10 12:05 AM, "Joseph Stein" <cryptcom@gmail.com> wrote:
> The thing is I have to check historic data (meaning data I have
> already aggregated against) so I basically need to hold and read from
> a file of hashes.
> So within the current data set yes this would work but I then have to
> open a file, loop through the value, see it is not there.
> If it is there then throw it out, if not there add it to the end.
> To me this opening a file checking for dups is a map/reduce task in itself.
> What I was thinking is having my mapper take the data I wasn to
> validate as unique.  I then loop through the files filters.  each data
> point has a key that then allows me to get the file that has it's
> data. e.g. a part of the data partions the hash of the data so each
> file holds.  So my map job takes the data and breaks it into the
> key/value pair (the key allows me to look up my filter file).
> When it gets to the reducer... the key is the file I open up, I then
> open the file... loop through it... if it is there throw the data
> away.  if it is not there then add the hash of my data to the filter
> file and then output (as the reduce output) the value of the unique.
> This output of the unique is then the data I aggregate on which also
> updated my historic filter so the next job (5 minutes later) see it,
> etc.
> On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <markkerzner@gmail.com>
> wrote:
> > Joe,
> >
> > what about this approach:
> >
> > using hashmap values as your keys in MR maps. Since they are sorted by
> keys,
> > in reducer you will get all duplicates together, so that you can loop
> > through them. As the simplest solution, you just take the first one.
> >
> > Sincerely,
> > Mark
> >
> > On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <cryptcom@gmail.com>
> wrote:
> >
> >> I have been researching ways to handle de-dupping data while running a
> >> map/reduce program (so as to not re-calculate/re-aggregate data that
> >> we have seen before[possibly months before]).
> >>
> >> The data sets we have are littered with repeats of data from mobile
> >> devices which continue to come in over time (so we may see duplicates
> >> of data re-posted months after it originally posted...)
> >>
> >> I have 2 ways so far I can go about it (one way I do in production
> >> without Hadoop) and interested to see if others have faced/solved this
> >> in Hadoop/HDFS and what their experience might be.
> >>
> >> 1) handle my own hash filter (where I continually store and look up a
> >> hash (MD5, bloom, whatever) of the data I am aggregating on as
> >> existing already).  We do this now without Hadoop perhaps a variant
> >> can be ported into HDFS as map task, reducing the results to files and
> >> restoring the hash table (maybe in Hive or something, dunno yet)
> >> 2) push the data into Cassandra (our NoSQL solution of choice) and let
> >> that hash/map system do it for us.   As I get more into Hadoop looking
> >> at HBas is tempting but then just one more thing to learn.
> >>
> >> I would really like to not have to reinvent a wheel here and even
> >> contribute if something is going on as it is a use case in our work
> >> effort.
> >>
> >> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
> >> by accident (so this is not a repost spam but appropriate for this
> >> list)
> >>
> >> Cheers.
> >>
> >> /*
> >> Joe Stein
> >> http://www.linkedin.com/in/charmalloc
> >> */
> >>
> >
> --
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */

Andrew Klochkov

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message