hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Kelkar <rohitkel...@gmail.com>
Subject Re: How I can create map/reduce for a spreadsheet calculation?
Date Tue, 03 Apr 2012 04:50:45 GMT
Your idea in first paragraph is correct. To speed up things you can
also explore the possibility of using a Combiner. For ex. for
computing the sum set the combiner to be the same class as your
reducer. For calculating variance write a combiner class that would
output (xi - mu)^2 and in the reducer code you could take the sqrt.

Your second assumption that number of reducers = number of variables
is not right.

- Rohit Kelkar

On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin <nusfangxin@gmail.com> wrote:
> Hi,
>
> I have a spreadsheet where each column contains values for one
> variable. and I need to calculate sum, variance, etc for each column.
> For my understanding, mapper and reducer work for <key, value> pair,
> can anyone kindly enlighten me how to abstract this problem?
>
> Maybe for the mapper, let it read each line, set variable name/number
> as "key", and corresponding value as "value".
> Then when all pairs with the same "key" (i.e. they belong to same
> variable) be passed to a reducer, reducer can do the calculation, and
> output to file.
> is this idea correct? can anyone kindly give some comment?
>
> Besides, in this method, the number of reducers will be determined by
> the number of variables I have.
> What happen if variable number is limited, and for each variable, the
> number of entries is far much bigger than the total number of
> variables, then execution time for each reducer can be comparatively
> long.
> Any way to make use of more hardware resource, and create more
> reducers to run in parallel?
>
> Best regards,
> Xin

Mime
View raw message