From Joel Welling <well...@psc.edu>
Subject Re: reading input for a map function from 2 different files?
Date Wed, 12 Nov 2008 19:46:12 GMT
```Amar, isn't there a problem with your method in that it gets a small
result by subtracting very large numbers?  Given a million inputs, won't
A and B be so much larger than the standard deviation that there aren't
enough no bits left in the floating point number to represent it?

I just thought I should mention that, before this thread goes in an
archive somewhere and some student looks it up.

-Joel

On Wed, 2008-11-12 at 12:32 +0530, Amar Kamat wrote:
> some speed wrote:
> > Thanks for the response. What I am trying is to do is finding the average
> > and then the standard deviation for a very large set (say a million) of
> > numbers. The result would be used in further calculations.
> > I have got the average from the first map-reduce chain. now i need to read
> > this average as well as the set of numbers to calculate the standard
> > deviation.  so one file would have the input set and the other "resultant"
> > file would have just the average.
> > Please do tell me in case there is a better way of doing things than what i
> > am doing. Any input/suggestion is appreciated.:)
> >
> >
> std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg.
> Why dont you use the formula to compute it in one MR job.
> std_dev^2 = (sum_i(Xi ^ 2)  - N * (Xa ^ 2) ) / N;
>                  = (A - N*(avg^2))/N
>
> For this your map would look like
>    map (key, val) : output.collect(key^2, key); // imagine your input as
> (k,v) = (Xi, null)
> Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and
> sum over the values to find out 'Xa'. You could use the close() api to
> finally dump there 2 values to a file.
>
> For example :
> input : 1,2,3,4
> Say input is split in 2 groups [1,2] and [4,5]
> Now there will be 2 maps with output as follows
> map1 output : (1,1) (4,2)
> map2 output : (9,3) (16,4)
>
> Reducer will maintain the sum over all keys and all values
> A = sum(key i.e  input squared) = 1+ 4 + 9 + 16 = 30
> B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10
>
> With A and B you can compute the standard deviation offline.
> So avg = B / N = 10/4 = 2.5
> Hence the std deviation would be
> sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399
>
> *Using the main formula the answer is *1.11803399*
> Amar
> >
> > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <amarrk@yahoo-inc.com> wrote:
> >
> >
> >> Amar Kamat wrote:
> >>
> >>
> >>> some speed wrote:
> >>>
> >>>
> >>>> I was wondering if it was possible to read the input for a map function
> >>>> from
> >>>> 2 different files:
> >>>>  1st file ---> user-input file from a particular location(path)
> >>>>
> >>>>
> >>> Is the input/user file sorted? If yes then you can use "map-side join" for
> >>>
> >> performance reasons. See org.apache.hadoop.mapred.join for more details.
> >>
> >>
> >>> 2nd file=---> A resultant file (has just one <key,value> pair)
from a
> >>>
> >>>> previous MapReduce job. (I am implementing a chain MapReduce function)
> >>>>
> >>>>
> >>> Can you explain in more detail the contents of 2nd file?
> >>>
> >>>> Now, for every <key,value> pair in the user-input file, I would
like to
> >>>> use
> >>>> the same <key,value> pair from the 2nd file for some calculations.
> >>>>
> >>>>
> >>> Can you explain this in more detail? Can you give some abstracted example
> >>>
> >> of how file1 and file2 look like and what operation/processing you want to
> >> do?
> >>
> >>
> >>
> >>> I guess you might need to do some kind of join on the 2 files. Look at
> >>> contrib/data_join for more details.
> >>> Amar
> >>>
> >>>
> >>>> Is it possible for me to do so? Can someone guide me in the right
> >>>> direction
> >>>>
> >>>>
> >>>> Thanks!
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >
> >

```
