hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: Cumulative value using mapreduce
Date Thu, 04 Oct 2012 21:21:21 GMT
I indeed didn't catch the cumulative sum part. Then I guess it begs for
what-is-often-called-a-secondary-sort, if you want to compute different
cumulative sums during the same job. It can be more or less easy to
implement depending on which API/library/tool you are using. Ted comments
on performance are spot on.

Regards

Bertrand

On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <java8964@hotmail.com>wrote:

>  I did the cumulative sum in the HIVE UDF, as one of the project for my
> employer.
>
> 1) You need to decide the grouping elements for your cumulative. For
> example, an account, a department etc. In the mapper, combine these
> information as your omit key.
> 2) If you don't have any grouping requirement, you just want a cumulative
> sum for all your data, then send all the data to one common key, so they
> will all go to the same reducer.
> 3) When you calculate the cumulative sum, does the output need to have a
> sorting order? If so, you need to do the 2nd sorting, so the data will be
> sorted as the order you want in the reducer.
> 4) In the reducer, just do the sum, omit every value per original record
> (Not per key).
>
> I will suggest you do this in the UDF of HIVE, as it is much easy, if you
> can build a HIVE schema on top of your data.
>
> Yong
>
> ------------------------------
> From: tdunning@maprtech.com
> Date: Thu, 4 Oct 2012 18:52:09 +0100
> Subject: Re: Cumulative value using mapreduce
> To: user@hadoop.apache.org
>
>
> Bertrand is almost right.
>
> The only difference is that the original poster asked about cumulative sum.
>
> This can be done in reducer exactly as Bertrand described except for two
> points that make it different from word count:
>
> a) you can't use a combiner
>
> b) the output of the program is as large as the input so it will have
> different performance characteristics than aggregation programs like
> wordcount.
>
> Bertrand's key recommendation to go read a book is the most important
> advice.
>
> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:
>
> Hi,
>
> It sounds like a
> 1) group information by account
> 2) compute sum per account
>
> If that not the case, you should precise a bit more about your context.
>
> This computing looks like a small variant of wordcount. If you do not know
> how to do it, you should read books about Hadoop MapReduce and/or online
> tutorial. Yahoo's is old but still a nice read to begin with :
> http://developer.yahoo.com/hadoop/tutorial/
>
> Regards,
>
> Bertrand
>
>
> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
> Hi,
>
> I have a file which has some financial transaction data. Each transaction
> will have amount and a credit/debit indicator.
> I want to write a mapreduce program which computes cumulative credit &
> debit amounts at each record
> and append these values to the record before dumping into the output file.
>
> Is this possible? How can I achieve this? Where should i put the logic of
> computing the cumulative values?
>
> Regards,
> Sarath.
>
>
>
>
> --
> Bertrand Dechoux
>
>
>


-- 
Bertrand Dechoux

Mime
View raw message