hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Lan" <c...@modeln.com>
Subject RE: Calculations involve large datasets
Date Tue, 26 Feb 2008 01:49:01 GMT
Thanks for the explanation.  Now I just gotta find some time to do a
POC!

-Chuck

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Friday, February 22, 2008 3:58 PM
To: core-user@hadoop.apache.org
Subject: Re: Calculations involve large datasets



Joins are easy.

Just reduce on a key composed of the stuff you want to join on.  If the
data
you are joining is disparate, leave some kind of hint about what kind of
record you have.

The reducer will be iterating through sets of records that have the same
key.  This is similar to the results of an outer join, except that if
you
are joining A and B and there are multiple records with the join key in
either A or B, you will see them in the same reduce.  In many such
cases, MR
is actually more efficient than a traditional join because you don't
necessarily want to generate the cross product of records.

In the reduce, you should build your composite record or do your
composite
on a virtual composite join.

If you are doing a many-to-one join, then you often want the one to
appear
before the many to avoid having to buffer the many until you see the
one.
This can be done by sorting on your group key plus a source key, but
grouping on just the group key.


You should definitely look at Pig as well since it might fit what I
would
presume to be a fairly SQL centric culture better than writing large
Java
programs.  Last time I looked (a few months ago), it was definitely not
ready for us and we have gone other directions.  The pace of change has
been
prodigous, however, so I expect it is much better than when I last
looked
hard.


On 2/22/08 10:12 AM, "Tim Wintle" <tim.wintle@teamrubber.com> wrote:

> Have you seen PIG:
> http://incubator.apache.org/pig/
> 
> It generates hadoop code and is more query like, and (as far as I
> remember) includes union, join, etc.
> 
> Tim
> 
> On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote:
>> Hi,
>> 
>> I'm currently looking into how to better scale the performance of our
>> calculations involving large sets of financial data.  It is currently
using
>> a series of Oracle SQL statements to perform the calculations.  It
seems to
>> me that the MapReduce algorithm may work in this scenario.  However,
I
>> believe would need to perform some denormalization of data in order
for this
>> to work.  Do I have to?  Or is there a good way to implement joins
within
>> the Hadoop framework efficiently?
>> 
>> Thanks,
>> Chuck
> 


Mime
View raw message