hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miles Osborne" <mi...@inf.ed.ac.uk>
Subject Re: reading input for a map function from 2 different files?
Date Wed, 12 Nov 2008 19:53:41 GMT
unless you really care about getting exact averages etc, i would
suggest simply sampling the input and computing your statistics from
that

--it will be a lot faster and you won't have to deal with under/overflow etc

if your sample is reasonably large then your results will be pretty
close to the true values

Miles

2008/11/12 Joel Welling <welling@psc.edu>:
> Amar, isn't there a problem with your method in that it gets a small
> result by subtracting very large numbers?  Given a million inputs, won't
> A and B be so much larger than the standard deviation that there aren't
> enough no bits left in the floating point number to represent it?
>
> I just thought I should mention that, before this thread goes in an
> archive somewhere and some student looks it up.
>
> -Joel
>
> On Wed, 2008-11-12 at 12:32 +0530, Amar Kamat wrote:
>> some speed wrote:
>> > Thanks for the response. What I am trying is to do is finding the average
>> > and then the standard deviation for a very large set (say a million) of
>> > numbers. The result would be used in further calculations.
>> > I have got the average from the first map-reduce chain. now i need to read
>> > this average as well as the set of numbers to calculate the standard
>> > deviation.  so one file would have the input set and the other "resultant"
>> > file would have just the average.
>> > Please do tell me in case there is a better way of doing things than what i
>> > am doing. Any input/suggestion is appreciated.:)
>> >
>> >
>> std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg.
>> Why dont you use the formula to compute it in one MR job.
>> std_dev^2 = (sum_i(Xi ^ 2)  - N * (Xa ^ 2) ) / N;
>>                  = (A - N*(avg^2))/N
>>
>> For this your map would look like
>>    map (key, val) : output.collect(key^2, key); // imagine your input as
>> (k,v) = (Xi, null)
>> Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and
>> sum over the values to find out 'Xa'. You could use the close() api to
>> finally dump there 2 values to a file.
>>
>> For example :
>> input : 1,2,3,4
>> Say input is split in 2 groups [1,2] and [4,5]
>> Now there will be 2 maps with output as follows
>> map1 output : (1,1) (4,2)
>> map2 output : (9,3) (16,4)
>>
>> Reducer will maintain the sum over all keys and all values
>> A = sum(key i.e  input squared) = 1+ 4 + 9 + 16 = 30
>> B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10
>>
>> With A and B you can compute the standard deviation offline.
>> So avg = B / N = 10/4 = 2.5
>> Hence the std deviation would be
>> sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399
>>
>> *Using the main formula the answer is *1.11803399*
>> Amar
>> >
>> > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <amarrk@yahoo-inc.com> wrote:
>> >
>> >
>> >> Amar Kamat wrote:
>> >>
>> >>
>> >>> some speed wrote:
>> >>>
>> >>>
>> >>>> I was wondering if it was possible to read the input for a map function
>> >>>> from
>> >>>> 2 different files:
>> >>>>  1st file ---> user-input file from a particular location(path)
>> >>>>
>> >>>>
>> >>> Is the input/user file sorted? If yes then you can use "map-side join"
for
>> >>>
>> >> performance reasons. See org.apache.hadoop.mapred.join for more details.
>> >>
>> >>
>> >>> 2nd file=---> A resultant file (has just one <key,value> pair)
from a
>> >>>
>> >>>> previous MapReduce job. (I am implementing a chain MapReduce function)
>> >>>>
>> >>>>
>> >>> Can you explain in more detail the contents of 2nd file?
>> >>>
>> >>>> Now, for every <key,value> pair in the user-input file, I
would like to
>> >>>> use
>> >>>> the same <key,value> pair from the 2nd file for some calculations.
>> >>>>
>> >>>>
>> >>> Can you explain this in more detail? Can you give some abstracted example
>> >>>
>> >> of how file1 and file2 look like and what operation/processing you want
to
>> >> do?
>> >>
>> >>
>> >>
>> >>> I guess you might need to do some kind of join on the 2 files. Look
at
>> >>> contrib/data_join for more details.
>> >>> Amar
>> >>>
>> >>>
>> >>>> Is it possible for me to do so? Can someone guide me in the right
>> >>>> direction
>> >>>> please?
>> >>>>
>> >>>>
>> >>>> Thanks!
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >
>> >
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Mime
View raw message