hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Lan" <c...@modeln.com>
Subject RE: Calculations involve large datasets
Date Tue, 26 Feb 2008 01:49:01 GMT
Thanks for the explanation.  Now I just gotta find some time to do a


-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Friday, February 22, 2008 3:58 PM
To: core-user@hadoop.apache.org
Subject: Re: Calculations involve large datasets

Joins are easy.

Just reduce on a key composed of the stuff you want to join on.  If the
you are joining is disparate, leave some kind of hint about what kind of
record you have.

The reducer will be iterating through sets of records that have the same
key.  This is similar to the results of an outer join, except that if
are joining A and B and there are multiple records with the join key in
either A or B, you will see them in the same reduce.  In many such
cases, MR
is actually more efficient than a traditional join because you don't
necessarily want to generate the cross product of records.

In the reduce, you should build your composite record or do your
on a virtual composite join.

If you are doing a many-to-one join, then you often want the one to
before the many to avoid having to buffer the many until you see the
This can be done by sorting on your group key plus a source key, but
grouping on just the group key.

You should definitely look at Pig as well since it might fit what I
presume to be a fairly SQL centric culture better than writing large
programs.  Last time I looked (a few months ago), it was definitely not
ready for us and we have gone other directions.  The pace of change has
prodigous, however, so I expect it is much better than when I last

On 2/22/08 10:12 AM, "Tim Wintle" <tim.wintle@teamrubber.com> wrote:

> Have you seen PIG:
> http://incubator.apache.org/pig/
> It generates hadoop code and is more query like, and (as far as I
> remember) includes union, join, etc.
> Tim
> On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote:
>> Hi,
>> I'm currently looking into how to better scale the performance of our
>> calculations involving large sets of financial data.  It is currently
>> a series of Oracle SQL statements to perform the calculations.  It
seems to
>> me that the MapReduce algorithm may work in this scenario.  However,
>> believe would need to perform some denormalization of data in order
for this
>> to work.  Do I have to?  Or is there a good way to implement joins
>> the Hadoop framework efficiently?
>> Thanks,
>> Chuck

View raw message