hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi" <runp...@yahoo-inc.com>
Subject RE: Calculations involve large datasets
Date Sat, 23 Feb 2008 00:12:08 GMT

There is a package for joining data from multiple sources: 

contrib/data-join.

It implements the basic joining logic and allows the user to provide
application specific logic for filtering/projecting and combining
multiple records into one.

Runping


> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Friday, February 22, 2008 3:58 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Calculations involve large datasets
> 
> 
> 
> Joins are easy.
> 
> Just reduce on a key composed of the stuff you want to join on.  If
the
> data
> you are joining is disparate, leave some kind of hint about what kind
of
> record you have.
> 
> The reducer will be iterating through sets of records that have the
same
> key.  This is similar to the results of an outer join, except that if
you
> are joining A and B and there are multiple records with the join key
in
> either A or B, you will see them in the same reduce.  In many such
cases,
> MR
> is actually more efficient than a traditional join because you don't
> necessarily want to generate the cross product of records.
> 
> In the reduce, you should build your composite record or do your
composite
> on a virtual composite join.
> 
> If you are doing a many-to-one join, then you often want the one to
appear
> before the many to avoid having to buffer the many until you see the
one.
> This can be done by sorting on your group key plus a source key, but
> grouping on just the group key.
> 
> 
> You should definitely look at Pig as well since it might fit what I
would
> presume to be a fairly SQL centric culture better than writing large
Java
> programs.  Last time I looked (a few months ago), it was definitely
not
> ready for us and we have gone other directions.  The pace of change
has
> been
> prodigous, however, so I expect it is much better than when I last
looked
> hard.
> 
> 
> On 2/22/08 10:12 AM, "Tim Wintle" <tim.wintle@teamrubber.com> wrote:
> 
> > Have you seen PIG:
> > http://incubator.apache.org/pig/
> >
> > It generates hadoop code and is more query like, and (as far as I
> > remember) includes union, join, etc.
> >
> > Tim
> >
> > On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote:
> >> Hi,
> >>
> >> I'm currently looking into how to better scale the performance of
our
> >> calculations involving large sets of financial data.  It is
currently
> using
> >> a series of Oracle SQL statements to perform the calculations.  It
> seems to
> >> me that the MapReduce algorithm may work in this scenario.
However, I
> >> believe would need to perform some denormalization of data in order
for
> this
> >> to work.  Do I have to?  Or is there a good way to implement joins
> within
> >> the Hadoop framework efficiently?
> >>
> >> Thanks,
> >> Chuck
> >


Mime
View raw message