Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of welling@psc.edu designates
 128.182.58.100 as permitted sender)
Subject: Re: reading input for a map function from 2 different files?
From: Joel Welling <welling@psc.edu>
Reply-To: welling@psc.edu
To: core-user@hadoop.apache.org
Cc: welling@psc.edu
In-Reply-To: <491A7F79.3080302@yahoo-inc.com>
References: <4abea26c0811092110qc5cb25by14ef0e6b02b855dd@mail.gmail.com>
	 <4917DA88.2060101@yahoo-inc.com> <4917FD54.50502@yahoo-inc.com>
	 <4abea26c0811112025p2822d558kdf3a5f01a2c8b2ea@mail.gmail.com>
	 <491A7F79.3080302@yahoo-inc.com>
Content-Type: text/plain
Organization: Pittsburgh Supercomputing Center
Date: Wed, 12 Nov 2008 14:46:12 -0500
Message-Id: <1226519172.17978.26.camel@welling-laptop>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit

Amar, isn't there a problem with your method in that it gets a small
result by subtracting very large numbers?  Given a million inputs, won't
A and B be so much larger than the standard deviation that there aren't
enough no bits left in the floating point number to represent it?

I just thought I should mention that, before this thread goes in an
archive somewhere and some student looks it up.

-Joel

On Wed, 2008-11-12 at 12:32 +0530, Amar Kamat wrote:
> some speed wrote:
> > Thanks for the response. What I am trying is to do is finding the average
> > and then the standard deviation for a very large set (say a million) of
> > numbers. The result would be used in further calculations.
> > I have got the average from the first map-reduce chain. now i need to read
> > this average as well as the set of numbers to calculate the standard
> > deviation.  so one file would have the input set and the other "resultant"
> > file would have just the average.
> > Please do tell me in case there is a better way of doing things than what i
> > am doing. Any input/suggestion is appreciated.:)
> >
> >   
> std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg.
> Why dont you use the formula to compute it in one MR job.
> std_dev^2 = (sum_i(Xi ^ 2)  - N * (Xa ^ 2) ) / N;
>                  = (A - N*(avg^2))/N
> 
> For this your map would look like
>    map (key, val) : output.collect(key^2, key); // imagine your input as 
> (k,v) = (Xi, null)
> Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and 
> sum over the values to find out 'Xa'. You could use the close() api to 
> finally dump there 2 values to a file.
> 
> For example :
> input : 1,2,3,4
> Say input is split in 2 groups [1,2] and [4,5]
> Now there will be 2 maps with output as follows
> map1 output : (1,1) (4,2)
> map2 output : (9,3) (16,4)
> 
> Reducer will maintain the sum over all keys and all values
> A = sum(key i.e  input squared) = 1+ 4 + 9 + 16 = 30
> B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10
> 
> With A and B you can compute the standard deviation offline.
> So avg = B / N = 10/4 = 2.5
> Hence the std deviation would be
> sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399
> 
> *Using the main formula the answer is *1.11803399*
> Amar
> >
> > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <amarrk@yahoo-inc.com> wrote:
> >
> >   
> >> Amar Kamat wrote:
> >>
> >>     
> >>> some speed wrote:
> >>>
> >>>       
> >>>> I was wondering if it was possible to read the input for a map function
> >>>> from
> >>>> 2 different files:
> >>>>  1st file ---> user-input file from a particular location(path)
> >>>>
> >>>>         
> >>> Is the input/user file sorted? If yes then you can use "map-side join" for
> >>>       
> >> performance reasons. See org.apache.hadoop.mapred.join for more details.
> >>
> >>     
> >>> 2nd file=---> A resultant file (has just one <key,value> pair) from a
> >>>       
> >>>> previous MapReduce job. (I am implementing a chain MapReduce function)
> >>>>
> >>>>         
> >>> Can you explain in more detail the contents of 2nd file?
> >>>       
> >>>> Now, for every <key,value> pair in the user-input file, I would like to
> >>>> use
> >>>> the same <key,value> pair from the 2nd file for some calculations.
> >>>>
> >>>>         
> >>> Can you explain this in more detail? Can you give some abstracted example
> >>>       
> >> of how file1 and file2 look like and what operation/processing you want to
> >> do?
> >>
> >>
> >>     
> >>> I guess you might need to do some kind of join on the 2 files. Look at
> >>> contrib/data_join for more details.
> >>> Amar
> >>>
> >>>       
> >>>> Is it possible for me to do so? Can someone guide me in the right
> >>>> direction
> >>>> please?
> >>>>
> >>>>
> >>>> Thanks!
> >>>>
> >>>>
> >>>>
> >>>>         
> >>>       
> >
> >