hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Ulrik S√łttrup <soett...@nbi.dk>
Subject joining two large files in hadoop
Date Sat, 04 Apr 2009 21:11:32 GMT
Hello all,

I need to do some calculations that has to merge two sets of very large 
data  (basically calculate variance).
One set contains a set of "means" and the second  a set of objects tied 
to a mean.

Normally I would  send the set of means using the distributed cache, but 
the set has become too large to keep in memory and it is going to grow 
in the future.

I would like to join the two data files so that each mapper gets the 
entries of both files with the same keys. I have seen there is a 
CompositeInputFormat but there is no real documentation on it.

Can anyone enlighten me on whether it would be useful and how it works.


View raw message