Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 88897 invoked from network); 5 Apr 2009 04:48:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Apr 2009 04:48:00 -0000 Received: (qmail 75391 invoked by uid 500); 5 Apr 2009 04:47:57 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 75283 invoked by uid 500); 5 Apr 2009 04:47:57 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 75272 invoked by uid 99); 5 Apr 2009 04:47:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2009 04:47:57 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tlipcon@gmail.com designates 209.85.217.164 as permitted sender) Received: from [209.85.217.164] (HELO mail-gx0-f164.google.com) (209.85.217.164) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2009 04:47:48 +0000 Received: by gxk8 with SMTP id 8so3742457gxk.5 for ; Sat, 04 Apr 2009 21:47:27 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <49D7CD04.7030708@nbi.dk> References: <49D7CD04.7030708@nbi.dk> Date: Sat, 4 Apr 2009 21:47:12 -0700 Received: by 10.231.13.136 with SMTP id c8mr659021iba.45.1238906847251; Sat, 04 Apr 2009 21:47:27 -0700 (PDT) Message-ID: Subject: Re: joining two large files in hadoop From: Todd Lipcon To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0003255756168a4fb00466c77ac9 X-Virus-Checked: Checked by ClamAV on apache.org --0003255756168a4fb00466c77ac9 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Sat, Apr 4, 2009 at 2:11 PM, Christian Ulrik S=F8ttrup = wrote: > Hello all, > > I need to do some calculations that has to merge two sets of very large > data (basically calculate variance). > One set contains a set of "means" and the second a set of objects tied t= o > a mean. > > Normally I would send the set of means using the distributed cache, but > the set has become too large to keep in memory and it is going to grow in > the future. > Hi Christian, Others have done a good job answering your question about doing this as a join, but here's one idea that might allow you to skip the join altogether: If you're simply calculating variance of data sets, you can use a bit of a math trick to do it in one pass without precomputing the means: E =3D the expectation operator mu =3D mean =3D E[x] Variance =3D E[ (x - mu)^2 ] Expand the square: =3D E[x^2 - 2*x*mu + mu^2] by linearity of expectation: =3D E[x^2] - 2*mu*E[x] + E[mu^2] mu in this equation is constant, so E[mu^2] =3D mu^2. Also recall that E[x] =3D mu =3D E[x^2] - 2*E[x]^2 + E[x]^2 =3D E[x^2] - E[x]^2 Apologies for the ugly math notation, but hopefully it's clear. The takeawa= y is that you can separately calculate sum(x^2) and sum(x) in your job, and calculate variance directly from the results. Here's the general outline fo= r the MR job: Map: collect (1, x, x^2) Combine: sum up tuples Reduce: input from combine: (N, sum(x), sum(x^2)) output: Variance =3D 1/N(sum(x^2)) - (1/N sum(x))^2 Hope that's helpful for you! -Todd --0003255756168a4fb00466c77ac9--