Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 57272 invoked from network); 12 Nov 2008 19:46:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Nov 2008 19:46:54 -0000 Received: (qmail 40274 invoked by uid 500); 12 Nov 2008 19:46:50 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 40226 invoked by uid 500); 12 Nov 2008 19:46:50 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 40183 invoked by uid 99); 12 Nov 2008 19:46:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2008 11:46:50 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of welling@psc.edu designates 128.182.58.100 as permitted sender) Received: from [128.182.58.100] (HELO mailer1.psc.edu) (128.182.58.100) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2008 19:45:30 +0000 Received: from [128.182.154.117] (dhcp154q.psc.edu [128.182.154.117]) (authenticated bits=0) by mailer1.psc.edu (8.14.2/8.13.3) with ESMTP id mACJkCIl008686 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 12 Nov 2008 14:46:12 -0500 (EST) Subject: Re: reading input for a map function from 2 different files? From: Joel Welling Reply-To: welling@psc.edu To: core-user@hadoop.apache.org Cc: welling@psc.edu In-Reply-To: <491A7F79.3080302@yahoo-inc.com> References: <4abea26c0811092110qc5cb25by14ef0e6b02b855dd@mail.gmail.com> <4917DA88.2060101@yahoo-inc.com> <4917FD54.50502@yahoo-inc.com> <4abea26c0811112025p2822d558kdf3a5f01a2c8b2ea@mail.gmail.com> <491A7F79.3080302@yahoo-inc.com> Content-Type: text/plain Organization: Pittsburgh Supercomputing Center Date: Wed, 12 Nov 2008 14:46:12 -0500 Message-Id: <1226519172.17978.26.camel@welling-laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.22.1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Amar, isn't there a problem with your method in that it gets a small result by subtracting very large numbers? Given a million inputs, won't A and B be so much larger than the standard deviation that there aren't enough no bits left in the floating point number to represent it? I just thought I should mention that, before this thread goes in an archive somewhere and some student looks it up. -Joel On Wed, 2008-11-12 at 12:32 +0530, Amar Kamat wrote: > some speed wrote: > > Thanks for the response. What I am trying is to do is finding the average > > and then the standard deviation for a very large set (say a million) of > > numbers. The result would be used in further calculations. > > I have got the average from the first map-reduce chain. now i need to read > > this average as well as the set of numbers to calculate the standard > > deviation. so one file would have the input set and the other "resultant" > > file would have just the average. > > Please do tell me in case there is a better way of doing things than what i > > am doing. Any input/suggestion is appreciated.:) > > > > > std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg. > Why dont you use the formula to compute it in one MR job. > std_dev^2 = (sum_i(Xi ^ 2) - N * (Xa ^ 2) ) / N; > = (A - N*(avg^2))/N > > For this your map would look like > map (key, val) : output.collect(key^2, key); // imagine your input as > (k,v) = (Xi, null) > Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and > sum over the values to find out 'Xa'. You could use the close() api to > finally dump there 2 values to a file. > > For example : > input : 1,2,3,4 > Say input is split in 2 groups [1,2] and [4,5] > Now there will be 2 maps with output as follows > map1 output : (1,1) (4,2) > map2 output : (9,3) (16,4) > > Reducer will maintain the sum over all keys and all values > A = sum(key i.e input squared) = 1+ 4 + 9 + 16 = 30 > B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10 > > With A and B you can compute the standard deviation offline. > So avg = B / N = 10/4 = 2.5 > Hence the std deviation would be > sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399 > > *Using the main formula the answer is *1.11803399* > Amar > > > > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat wrote: > > > > > >> Amar Kamat wrote: > >> > >> > >>> some speed wrote: > >>> > >>> > >>>> I was wondering if it was possible to read the input for a map function > >>>> from > >>>> 2 different files: > >>>> 1st file ---> user-input file from a particular location(path) > >>>> > >>>> > >>> Is the input/user file sorted? If yes then you can use "map-side join" for > >>> > >> performance reasons. See org.apache.hadoop.mapred.join for more details. > >> > >> > >>> 2nd file=---> A resultant file (has just one pair) from a > >>> > >>>> previous MapReduce job. (I am implementing a chain MapReduce function) > >>>> > >>>> > >>> Can you explain in more detail the contents of 2nd file? > >>> > >>>> Now, for every pair in the user-input file, I would like to > >>>> use > >>>> the same pair from the 2nd file for some calculations. > >>>> > >>>> > >>> Can you explain this in more detail? Can you give some abstracted example > >>> > >> of how file1 and file2 look like and what operation/processing you want to > >> do? > >> > >> > >> > >>> I guess you might need to do some kind of join on the 2 files. Look at > >>> contrib/data_join for more details. > >>> Amar > >>> > >>> > >>>> Is it possible for me to do so? Can someone guide me in the right > >>>> direction > >>>> please? > >>>> > >>>> > >>>> Thanks! > >>>> > >>>> > >>>> > >>>> > >>> > > > >