hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Kelkar <rohitkel...@gmail.com>
Subject Re: How I can create map/reduce for a spreadsheet calculation?
Date Tue, 03 Apr 2012 05:13:00 GMT
Sorry, I misread your second question earlier.
In your context one variable = one column from you spreadsheet. (Am I right?)
In that case, you can set job.setNumReduceTasks(N) where N: number of
columns in your spreadsheet.
Note that if you go to http://<hadoop master's ip
address>:50030/jobtracker.jsp and look at the cluster summary. You
will notice a number for the "reduce task capacity". This is the max
number of reducers that would run concurrently on your cluster. No
matter what value of N you specify, you will be limited by the above
"reduce task capacity".

- Rohit Kelkar

On Tue, Apr 3, 2012 at 10:25 AM, Fang Xin <nusfangxin@gmail.com> wrote:
> Hi Rohit, thank you for your reply.
> As for the second assumption, could you kindly further enlighten me a
> bit, please?
> Thank you.
> On Tue, Apr 3, 2012 at 12:50 PM, Rohit Kelkar <rohitkelkar@gmail.com> wrote:
>> Your idea in first paragraph is correct. To speed up things you can
>> also explore the possibility of using a Combiner. For ex. for
>> computing the sum set the combiner to be the same class as your
>> reducer. For calculating variance write a combiner class that would
>> output (xi - mu)^2 and in the reducer code you could take the sqrt.
>> Your second assumption that number of reducers = number of variables
>> is not right.
>> - Rohit Kelkar
>> On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin <nusfangxin@gmail.com> wrote:
>>> Hi,
>>> I have a spreadsheet where each column contains values for one
>>> variable. and I need to calculate sum, variance, etc for each column.
>>> For my understanding, mapper and reducer work for <key, value> pair,
>>> can anyone kindly enlighten me how to abstract this problem?
>>> Maybe for the mapper, let it read each line, set variable name/number
>>> as "key", and corresponding value as "value".
>>> Then when all pairs with the same "key" (i.e. they belong to same
>>> variable) be passed to a reducer, reducer can do the calculation, and
>>> output to file.
>>> is this idea correct? can anyone kindly give some comment?
>>> Besides, in this method, the number of reducers will be determined by
>>> the number of variables I have.
>>> What happen if variable number is limited, and for each variable, the
>>> number of entries is far much bigger than the total number of
>>> variables, then execution time for each reducer can be comparatively
>>> long.
>>> Any way to make use of more hardware resource, and create more
>>> reducers to run in parallel?
>>> Best regards,
>>> Xin

View raw message