Hi Rohit, thank you for your reply.
As for the second assumption, could you kindly further enlighten me a
bit, please?
Thank you.
On Tue, Apr 3, 2012 at 12:50 PM, Rohit Kelkar <rohitkelkar@gmail.com> wrote:
> Your idea in first paragraph is correct. To speed up things you can
> also explore the possibility of using a Combiner. For ex. for
> computing the sum set the combiner to be the same class as your
> reducer. For calculating variance write a combiner class that would
> output (xi  mu)^2 and in the reducer code you could take the sqrt.
>
> Your second assumption that number of reducers = number of variables
> is not right.
>
>  Rohit Kelkar
>
> On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin <nusfangxin@gmail.com> wrote:
>> Hi,
>>
>> I have a spreadsheet where each column contains values for one
>> variable. and I need to calculate sum, variance, etc for each column.
>> For my understanding, mapper and reducer work for <key, value> pair,
>> can anyone kindly enlighten me how to abstract this problem?
>>
>> Maybe for the mapper, let it read each line, set variable name/number
>> as "key", and corresponding value as "value".
>> Then when all pairs with the same "key" (i.e. they belong to same
>> variable) be passed to a reducer, reducer can do the calculation, and
>> output to file.
>> is this idea correct? can anyone kindly give some comment?
>>
>> Besides, in this method, the number of reducers will be determined by
>> the number of variables I have.
>> What happen if variable number is limited, and for each variable, the
>> number of entries is far much bigger than the total number of
>> variables, then execution time for each reducer can be comparatively
>> long.
>> Any way to make use of more hardware resource, and create more
>> reducers to run in parallel?
>>
>> Best regards,
>> Xin
