hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: joining two large files in hadoop
Date Sat, 04 Apr 2009 21:36:36 GMT
>I need to do some calculations that has to merge two sets of very 
>large data  (basically calculate variance).
>One set contains a set of "means" and the second  a set of objects 
>tied to a mean.
>Normally I would  send the set of means using the distributed cache, 
>but the set has become too large to keep in memory and it is going 
>to grow in the future.

You might want to check out Cascading (http://www.cascading.org), 
which is an API for doing data processing on Hadoop - it has support 
for SQL-style joins (sounds like what you want) via its CoGroup pipe.

-- Ken
Ken Krugler
+1 530-210-6378

View raw message