hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From madhu phatak <phatak....@gmail.com>
Subject Re: How I can create map/reduce for a spreadsheet calculation?
Date Tue, 03 Apr 2012 05:14:19 GMT
Hi,
 You can refer to the following code to calculate sigmax(sum)

 Mappers
  Extracting a specific column -
https://github.com/zinnia-phatak-dev/Nectar/blob/master/Nectar-common/src/main/java/com/zinnia/nectar/util/hadoop/FieldSeperator.java

 Sum Mapper -
https://github.com/zinnia-phatak-dev/Nectar/blob/master/Nectar-regression/src/main/java/com/zinnia/nectar/regression/hadoop/primitive/mapreduce/SigmaMapper.java

 Sum Reducer -
https://github.com/zinnia-phatak-dev/Nectar/blob/master/Nectar-regression/src/main/java/com/zinnia/nectar/regression/hadoop/primitive/mapreduce/DoubleSumReducer.java

Driver or Main class -
https://github.com/zinnia-phatak-dev/Nectar/blob/master/Nectar-regression/src/main/java/com/zinnia/nectar/regression/hadoop/primitive/jobs/SigmaJob.java

By default it works for a tab seperarted  file . But you can easily change
the code by change FieldSeperator code.


On Tue, Apr 3, 2012 at 10:25 AM, Fang Xin <nusfangxin@gmail.com> wrote:

> Hi Rohit, thank you for your reply.
> As for the second assumption, could you kindly further enlighten me a
> bit, please?
>
> Thank you.
>
> On Tue, Apr 3, 2012 at 12:50 PM, Rohit Kelkar <rohitkelkar@gmail.com>
> wrote:
> > Your idea in first paragraph is correct. To speed up things you can
> > also explore the possibility of using a Combiner. For ex. for
> > computing the sum set the combiner to be the same class as your
> > reducer. For calculating variance write a combiner class that would
> > output (xi - mu)^2 and in the reducer code you could take the sqrt.
> >
> > Your second assumption that number of reducers = number of variables
> > is not right.
> >
> > - Rohit Kelkar
> >
> > On Tue, Apr 3, 2012 at 10:10 AM, Fang Xin <nusfangxin@gmail.com> wrote:
> >> Hi,
> >>
> >> I have a spreadsheet where each column contains values for one
> >> variable. and I need to calculate sum, variance, etc for each column.
> >> For my understanding, mapper and reducer work for <key, value> pair,
> >> can anyone kindly enlighten me how to abstract this problem?
> >>
> >> Maybe for the mapper, let it read each line, set variable name/number
> >> as "key", and corresponding value as "value".
> >> Then when all pairs with the same "key" (i.e. they belong to same
> >> variable) be passed to a reducer, reducer can do the calculation, and
> >> output to file.
> >> is this idea correct? can anyone kindly give some comment?
> >>
> >> Besides, in this method, the number of reducers will be determined by
> >> the number of variables I have.
> >> What happen if variable number is limited, and for each variable, the
> >> number of entries is far much bigger than the total number of
> >> variables, then execution time for each reducer can be comparatively
> >> long.
> >> Any way to make use of more hardware resource, and create more
> >> reducers to run in parallel?
> >>
> >> Best regards,
> >> Xin
>



-- 
https://github.com/zinnia-phatak-dev/Nectar

Mime
View raw message