Your idea in first paragraph is correct. To speed up things you can
also explore the possibility of using a Combiner. For ex. for
computing the sum set the combiner to be the same class as your
reducer. For calculating variance write a combiner class that would
output (xi  mu)^2 and in the reducer code you could take the sqrt.
Your second assumption that number of reducers = number of variables
is not right.
 Rohit Kelkar
On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin <nusfangxin@gmail.com> wrote:
> Hi,
>
> I have a spreadsheet where each column contains values for one
> variable. and I need to calculate sum, variance, etc for each column.
> For my understanding, mapper and reducer work for <key, value> pair,
> can anyone kindly enlighten me how to abstract this problem?
>
> Maybe for the mapper, let it read each line, set variable name/number
> as "key", and corresponding value as "value".
> Then when all pairs with the same "key" (i.e. they belong to same
> variable) be passed to a reducer, reducer can do the calculation, and
> output to file.
> is this idea correct? can anyone kindly give some comment?
>
> Besides, in this method, the number of reducers will be determined by
> the number of variables I have.
> What happen if variable number is limited, and for each variable, the
> number of entries is far much bigger than the total number of
> variables, then execution time for each reducer can be comparatively
> long.
> Any way to make use of more hardware resource, and create more
> reducers to run in parallel?
>
> Best regards,
> Xin
