Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of tlipcon@gmail.com designates
 209.85.217.164 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <49D7CD04.7030708@nbi.dk>
References: <49D7CD04.7030708@nbi.dk>
Date: Sat, 4 Apr 2009 21:47:12 -0700
Message-ID: <a8b6b1b40904042147k73c83da9n145f6305fb107a61@mail.gmail.com>
Subject: Re: joining two large files in hadoop
From: Todd Lipcon <todd@cloudera.com>
To: core-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0003255756168a4fb00466c77ac9

--0003255756168a4fb00466c77ac9
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Sat, Apr 4, 2009 at 2:11 PM, Christian Ulrik S=F8ttrup <soettrup@nbi.dk>=
wrote:

> Hello all,
>
> I need to do some calculations that has to merge two sets of very large
> data  (basically calculate variance).
> One set contains a set of "means" and the second  a set of objects tied t=
o
> a mean.
>
> Normally I would  send the set of means using the distributed cache, but
> the set has become too large to keep in memory and it is going to grow in
> the future.
>

Hi Christian,

Others have done a good job answering your question about doing this as a
join, but here's one idea that might allow you to skip the join altogether:

If you're simply calculating variance of data sets, you can use a bit of a
math trick to do it in one pass without precomputing the means:

E =3D the expectation operator
mu =3D mean =3D E[x]
Variance =3D E[ (x - mu)^2 ]

Expand the square:
=3D E[x^2 - 2*x*mu + mu^2]

by linearity of expectation:
=3D E[x^2] - 2*mu*E[x] + E[mu^2]

mu in this equation is constant, so E[mu^2] =3D mu^2.
Also recall that E[x] =3D mu
=3D E[x^2] - 2*E[x]^2 + E[x]^2
=3D E[x^2] - E[x]^2

Apologies for the ugly math notation, but hopefully it's clear. The takeawa=
y
is that you can separately calculate sum(x^2) and sum(x) in your job, and
calculate variance directly from the results. Here's the general outline fo=
r
the MR job:

Map:
  collect (1, x, x^2)
Combine:
  sum up tuples
Reduce:
  input from combine: (N, sum(x), sum(x^2))
  output: Variance =3D 1/N(sum(x^2)) - (1/N sum(x))^2

Hope that's helpful for you!

-Todd

--0003255756168a4fb00466c77ac9--