Hi,
I have a spreadsheet where each column contains values for one
variable. and I need to calculate sum, variance, etc for each column.
For my understanding, mapper and reducer work for <key, value> pair,
can anyone kindly enlighten me how to abstract this problem?
Maybe for the mapper, let it read each line, set variable name/number
as "key", and corresponding value as "value".
Then when all pairs with the same "key" (i.e. they belong to same
variable) be passed to a reducer, reducer can do the calculation, and
output to file.
is this idea correct? can anyone kindly give some comment?
Besides, in this method, the number of reducers will be determined by
the number of variables I have.
What happen if variable number is limited, and for each variable, the
number of entries is far much bigger than the total number of
variables, then execution time for each reducer can be comparatively
long.
Any way to make use of more hardware resource, and create more
reducers to run in parallel?
Best regards,
Xin
