# hadoop-common-user mailing list archives

##### Site index · List index
Message view
Top
From Todd Lipcon <t...@cloudera.com>
Subject Re: joining two large files in hadoop
Date Sun, 05 Apr 2009 04:47:12 GMT
```On Sat, Apr 4, 2009 at 2:11 PM, Christian Ulrik Søttrup <soettrup@nbi.dk>wrote:

> Hello all,
>
> I need to do some calculations that has to merge two sets of very large
> data  (basically calculate variance).
> One set contains a set of "means" and the second  a set of objects tied to
> a mean.
>
> Normally I would  send the set of means using the distributed cache, but
> the set has become too large to keep in memory and it is going to grow in
> the future.
>

Hi Christian,

Others have done a good job answering your question about doing this as a
join, but here's one idea that might allow you to skip the join altogether:

If you're simply calculating variance of data sets, you can use a bit of a
math trick to do it in one pass without precomputing the means:

E = the expectation operator
mu = mean = E[x]
Variance = E[ (x - mu)^2 ]

Expand the square:
= E[x^2 - 2*x*mu + mu^2]

by linearity of expectation:
= E[x^2] - 2*mu*E[x] + E[mu^2]

mu in this equation is constant, so E[mu^2] = mu^2.
Also recall that E[x] = mu
= E[x^2] - 2*E[x]^2 + E[x]^2
= E[x^2] - E[x]^2

Apologies for the ugly math notation, but hopefully it's clear. The takeaway
is that you can separately calculate sum(x^2) and sum(x) in your job, and
calculate variance directly from the results. Here's the general outline for
the MR job:

Map:
collect (1, x, x^2)
Combine:
sum up tuples
Reduce:
input from combine: (N, sum(x), sum(x^2))
output: Variance = 1/N(sum(x^2)) - (1/N sum(x))^2

Hope that's helpful for you!

-Todd

```
Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message