Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6BB44D152 for ; Fri, 5 Oct 2012 14:04:04 +0000 (UTC) Received: (qmail 39339 invoked by uid 500); 5 Oct 2012 14:03:58 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 39245 invoked by uid 500); 5 Oct 2012 14:03:58 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 39237 invoked by uid 99); 5 Oct 2012 14:03:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Oct 2012 14:03:58 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com designates 65.55.111.106 as permitted sender) Received: from [65.55.111.106] (HELO blu0-omc2-s31.blu0.hotmail.com) (65.55.111.106) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Oct 2012 14:03:48 +0000 Received: from BLU162-W5 ([65.55.111.72]) by blu0-omc2-s31.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 5 Oct 2012 07:03:27 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_57f8eedd-5725-45bf-aa81-c8678bc15782_" X-Originating-IP: [192.100.106.9] From: java8964 java8964 To: Subject: RE: Cumulative value using mapreduce Date: Fri, 5 Oct 2012 10:03:27 -0400 Importance: Normal In-Reply-To: <506E688B.2060900@algofusiontech.com> References: <506D9618.3050100@algofusiontech.com> ,<506E688B.2060900@algofusiontech.com> MIME-Version: 1.0 X-OriginalArrivalTime: 05 Oct 2012 14:03:27.0921 (UTC) FILETIME=[317DEA10:01CDA302] X-Virus-Checked: Checked by ClamAV on apache.org --_57f8eedd-5725-45bf-aa81-c8678bc15782_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Are you allowed to change the order of the data in the output? If you want = to calculate the cr/dr indicator cumulative sum value=2C then it will easy = if the business allow you to change the order of your data group by CR/DR i= ndicator in the output. For example=2C you can do it very easy with the way I described in my origi= nal email if you CAN change the output like following: Txn ID Cr/Dr Indicator Amount CR cumulative Amount= Dr Cumulative Amount1001 CR = 1000 1000 01004 = CR 2000 3000 = 01002 DR = 500 0 = 5001003 DR 1500 = 0 2000 As you can see=2C you have to group out your output by the Cr/Dr Indicator.= If you want to keep the original order=2C then it is hard=2C at least I ca= nnot think a way in short time. But if you allow to change the order of the output=2C then it is called cum= ulative sum with grouping (in this case=2C it is group1 for CR=2C group 2 f= or DR).=20 1) In the mapper=2C omit your data by Cr/Dr indicator=2C which will group t= he data by CR/DR. So all CR data will go to one reducer=2C then all DR data= will go to one reducer.2) Besides grouping the data=2C if you want the out= put sorted by the Amount (for example) in each group=2C then you have to do= the 2nd sorting. Google 2nd sort. Then for each group=2C the data arriving= into each reducer will be sorted by amount. Otherwise=2C if you don't need= that sorting=2C then just ignore the 2nd sorting.3) In each reducer=2C the= data arriving should be already grouped. The default partitioner for MR jo= b is Hash Partitioner. Depending on the hashCode() return for 'CR' and 'DR'= =2C these 2 groups data could go to different reducers (assuming you are ru= nning with multi reducers)=2C or they could go to the same reducers. But ev= en they are going to the same reducer=2C they will be arrived into 2 groups= . So the output of your reducers will be grouped=2C which is sorted by the = way.4) In your reducers=2C for the same group data=2C you will get an array= of values. For CR=2C you will get all the CR records in the array. What yo= u need to do is to Iterating your array=2C for every element=2C calculating= the cumulative sum=2C and omit the cumulative sum with the each record out= .5) In the end=2C your output could be multi files=2C as each file generate= d from one reducer. You can merge them into one file=2C or just leave them = as that in the HDFS.6) For best performance=2C if you have huge data=2C AND= you know all your possible value for THE Indicator=2C you may want to cons= ider use your own custom Partitioner=2C instead of HashPartitioner. What yo= u want is like a RoundRobin distribution of your keys inside the available = reducers=2C instead of Random distribution by hash value(). Keep in mind th= at random distribution DOES NOT work well if the distinct count of your key= s is small enough. Yong Date: Fri=2C 5 Oct 2012 10:26:43 +0530 From: sarathchandra.josyam@algofusiontech.com To: user@hadoop.apache.org Subject: Re: Cumulative value using mapreduce =20 =20 =20 =20 Thanks for all your responses. As suggested will go through the documentation once again. =20 But just to clarify=2C this is not my first map-reduce program. I've already written a map-reduce for our product which does filtering and transformation of the financial data. This is a new requirement we've got. I have also did the logic of calculating the cumulative sums. But the output is not coming as desired and I feel I'm not doing it right way and missing something. So thought of taking a quick help from the mailing list. =20 As an example=2C say we have records as below - =20 =20 =20 Txn ID =20 Txn Date =20 Cr/Dr Indicator =20 Amount =20 =20 =20 1001 =20 9/22/2012 =20 CR =20 1000 =20 =20 =20 1002 =20 9/25/2012 =20 DR =20 500 =20 =20 =20 1003 =20 10/1/2012 =20 DR =20 1500 =20 =20 =20 1004 =20 10/4/2012 =20 CR =20 2000 =20 =20 =20 =20 =20 When this file passed the logic should append the below 2 columns to the output for each record above - =20 =20 =20 CR Cumulative Amount =20 DR Cumulative Amount =20 =20 =20 1000 =20 0 =20 =20 =20 1000 =20 500 =20 =20 =20 1000 =20 2000 =20 =20 =20 3000 =20 2000 =20 =20 =20 =20 =20 Hope the problem is clear now. Please provide your suggestions on the approach to the solution. =20 Regards=2C Sarath. =20 On Friday 05 October 2012 02:51 AM=2C Bertrand Dechoux wrote: =20 I indeed didn't catch the cumulative sum part. Then I guess it begs for what-is-often-called-a-secondary-sort=2C if you want to compute different cumulative sums during the same job. It can be more or less easy to implement depending on which API/library/tool you are using. Ted comments on performance are spot on. =20 =20 =20 Regards =20 =20 Bertrand =20 On Thu=2C Oct 4=2C 2012 at 9:02 PM=2C java8964 java8964 wrote: =20 =20 =20 I did the cumulative sum in the HIVE UDF=2C as one of the project for my employer. =20 =20 1) You need to decide the grouping elements for your cumulative. For example=2C an account=2C a departmen= t etc. In the mapper=2C combine these information as your omit key. 2) If you don't have any grouping requirement=2C you just want a cumulative sum for all your data=2C then send all the data to one common key=2C so they will all go to the same reducer. 3) When you calculate the cumulative sum=2C does the output need to have a sorting order? If so=2C you need to do the 2nd sorting=2C so the data will be sorted as the order you want in the reducer. 4) In the reducer=2C just do the sum=2C omit every value per original record (Not per key). =20 =20 I will suggest you do this in the UDF of HIVE=2C as it is much easy=2C if you can build a HIVE schema on top of your data. =20 =20 Yong =20 =20 From: tdunning@maprtech.com Date: Thu=2C 4 Oct 2012 18:52:09 +0100 Subject: Re: Cumulative value using mapreduce To: user@hadoop.apache.org =20 =20 =20 Bertrand is almost right. =20 =20 The only difference is that the original poster asked about cumulative sum. =20 =20 This can be done in reducer exactly as Bertrand described except for two points that make it different from word count: =20 =20 a) you can't use a combiner =20 =20 b) the output of the program is as large as the input so it will have different performance characteristics than aggregation programs like wordcount. =20 =20 Bertrand's key recommendation to go read a book is the most important advice. =20 On Thu=2C Oct 4=2C 2012 at 5:20 PM=2C Bertrand Dechoux wrote: Hi=2C =20 =20 It sounds like a 1) group information by account 2) compute sum per account =20 =20 If that not the case=2C you should precise a bit more about your context. =20 =20 =20 This computing looks like a small variant of wordcount. If you do not know how to do it=2C you should read books about Hadoop MapReduce and/or online tutorial. Yahoo's is old but still a nice read to begin with : http://developer.= yahoo.com/hadoop/tutorial/ =20 =20 Regards=2C =20 =20 Bertrand =20 =20 =20 On Thu=2C Oct 4=2C 2012 at 3:58 PM=2C Sarath wrote: Hi=2C =20 I have a file which has some financial transaction data. Each transaction will have amount and a credit/debit indicator. I want to write a mapreduce program which computes cumulative credit & debit amounts at each record and append these values to the record before dumping into the output file. =20 Is this possible? How can I achieve this? Where should i put the logic of computing the cumulative values? =20 Regards=2C Sarath. =20 =20 =20 =20 =20 =20 =20 =20 --=20 Bertrand Dechoux =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 --=20 Bertrand Dechoux =20 = --_57f8eedd-5725-45bf-aa81-c8678bc15782_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Are you allowed to change the order of the data in the output? If you want = to calculate the cr/dr indicator cumulative sum value=2C then it will easy = if the business allow you to change the order of your data group by CR/DR i= ndicator in the output.

For example=2C you can do it ver= y easy with the way I described in my original email if you CAN change the = output like following:

Txn ID  =3B  =3B &n= bsp=3B  =3B  =3B  =3B Cr/Dr Indicator  =3B  =3B  = =3B  =3B Amount  =3B  =3B  =3BCR cumulative Amount  =3B=  =3B  =3B Dr Cumulative Amount
1001  =3B  =3B &n= bsp=3B  =3B  =3B  =3B  =3B  =3B  =3B CR  =3B &n= bsp=3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B  = =3B  =3B  =3B 1000  =3B  =3B  =3B  =3B  =3B &nb= sp=3B  =3B  =3B1000  =3B  =3B  =3B  =3B  =3B &n= bsp=3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B  = =3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B0
1004  =3B  =3B  =3B  =3B  =3B  =3B  =3B &nbs= p=3B  =3B CR  =3B  =3B  =3B  =3B  =3B  =3B &nbs= p=3B  =3B  =3B  =3B  =3B  =3B 2000  =3B  =3B &n= bsp=3B  =3B  =3B  =3B  =3B  =3B3000  =3B  =3B &= nbsp=3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B &nbs= p=3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B  = =3B  =3B  =3B0
1002  =3B  =3B  =3B  =3B &= nbsp=3B  =3B  =3B  =3B  =3B DR  =3B  =3B  =3B &= nbsp=3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B &nbs= p=3B  =3B500  =3B  =3B  =3B  =3B  =3B  =3B &nbs= p=3B  =3B 0  =3B  =3B  =3B  =3B  =3B  =3B  = =3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B=  =3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B500=
1003  =3B  =3B  =3B  =3B  =3B  =3B  = =3B  =3B  =3B DR  =3B  =3B  =3B  =3B  =3B  = =3B  =3B  =3B  =3B  =3B  =3B  =3B 1500  =3B &nb= sp=3B  =3B  =3B  =3B  =3B  =3B  =3B0  =3B  = =3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B=  =3B  =3B  =3B  =3B  =3B  =3B  =3B  =3B &n= bsp=3B  =3B  =3B  =3B2000

As you can s= ee=2C you have to group out your output by the Cr/Dr Indicator. If you want= to keep the original order=2C then it is hard=2C at least I cannot think a= way in short time.

But if you allow to change the= order of the output=2C then it is called cumulative sum with grouping (in = this case=2C it is group1 for CR=2C group 2 for DR). =3B

=
1) In the mapper=2C omit your data by Cr/Dr indicator=2C which w= ill group the data by CR/DR. So all CR data will go to one reducer=2C then = all DR data will go to one reducer.
2) Besides grouping the data= =2C if you want the output sorted by the Amount (for example) in each group= =2C then you have to do the 2nd sorting. Google 2nd sort. Then for each gro= up=2C the data arriving into each reducer will be sorted by amount. Otherwi= se=2C if you don't need that sorting=2C then just ignore the 2nd sorting.
3) In each reducer=2C the data arriving should be already grouped.= The default partitioner for MR job is Hash Partitioner. Depending on the h= ashCode() return for 'CR' and 'DR'=2C these 2 groups data could go to diffe= rent reducers (assuming you are running with multi reducers)=2C or they cou= ld go to the same reducers. But even they are going to the same reducer=2C = they will be arrived into 2 groups. So the output of your reducers will be = grouped=2C which is sorted by the way.
4) In your reducers=2C for= the same group data=2C you will get an array of values. For CR=2C you will= get all the CR records in the array. What you need to do is to Iterating y= our array=2C for every element=2C calculating the cumulative sum=2C and omi= t the cumulative sum with the each record out.
5) In the end=2C y= our output could be multi files=2C as each file generated from one reducer.= You can merge them into one file=2C or just leave them as that in the HDFS= .
6) For best performance=2C if you have huge data=2C AND you kno= w all your possible value for THE Indicator=2C you may want to consider use= your own custom Partitioner=2C instead of HashPartitioner. What you want i= s like a RoundRobin distribution of your keys inside the available reducers= =2C instead of Random distribution by hash value(). Keep in mind that rando= m distribution DOES NOT work well if the distinct count of your keys is sma= ll enough.

Yong



<= div>

Date: Fri= =2C 5 Oct 2012 10:26:43 +0530
From: sarathchandra.josyam@algofusiontech.= com
To: user@hadoop.apache.org
Subject: Re: Cumulative value using ma= preduce

=20 =20 =20 =20
Thanks for all your responses. As suggested will go through the documentation once again.

But just to clarify=2C this is not my first map-reduce program. I've already written a map-reduce for our product which does filtering and transformation of the financial data. This is a new requirement we've got. I have also did the logic of calculating the cumulative sums. But the output is not coming as desired and I feel I'm not doing it right way and missing something. So thought of taking a quick help from the mailing list.

As an example=2C say we have records as below -
Txn ID
Txn Date
Cr/Dr Indicator
Amount
1001
9/22/2012
CR
1000
1002
9/25/2012
DR
500
1003
10/1/2012
DR
1500
1004
10/4/2012
CR
2000

When this file passed the logic should append the below 2 columns to the output for each record above -
CR Cumulative Amount
DR Cumulative Amount
1000
0
1000
500
1000
2000
3000
2000

Hope the problem is clear now. Please provide your suggestions on the approach to the solution.

Regards=2C
Sarath.

On Friday 05 October 2012 02:51 AM=2C Bertrand Dechoux wrote:
I indeed didn't catch the cumulative sum part. Then I guess it begs for what-is-often-called-a-secondary-sort=2C if you want to compute different cumulative sums during the same job. It can be more or less easy to implement depending on which API/library/tool you are using. Ted comments on performance are spot on.

Regards

Bertrand

On Thu=2C Oct 4=2C 2012 at 9:02 PM=2C java8964 java8964 <=3Bjava8964@hotmail.com>=3B wrote:
I did the cumulative sum in the HIVE UDF=2C as one of the project for my employer.

1) You need to decide the grouping elements for your cumulative. For example=2C an account=2C a departmen= t etc. In the mapper=2C combine these information as your omit key.
2) If you don't have any grouping requirement=2C you just want a cumulative sum for all your data=2C then send all the data to one common key=2C so they will all go to the same reducer.
3) When you calculate the cumulative sum=2C does the output need to have a sorting order? If so=2C you need to do the 2nd sorting=2C so the data will be sorted as the order you want in the reducer.
4) In the reducer=2C just do the sum=2C omit every value per original record (Not per key).

I will suggest you do this in the UDF of HIVE=2C as it is much easy=2C if you can build a HIVE schema on top of your data.

Yong


From: tdun= ning@maprtech.com
Date: Thu=2C 4 Oct 2012 18:52:09 +0100
Subject: Re: Cumulative value using mapreduce
To: user@hado= op.apache.org


Bertrand is almost right.

The only difference is that the original poster asked about cumulative sum.

This can be done in reducer exactly as Bertrand described except for two points that make it different from word count:

a) you can't use a combiner

b) the output of the program is as large as the input so it will have different performance characteristics than aggregation programs like wordcount.

Bertrand's key recommendation to go read a book is the most important advice.

On Thu=2C Oct 4=2C 2012 at 5:20 PM=2C Bertra= nd Dechoux <=3Bdechouxb@gmail.com>=3B wrote:
Hi=2C

It sounds like a
1) group information by account
2) compute sum per account

If that not the case=2C you should precise a bit more about your context.

This computing looks like a small variant of wordcount. If you do not know how to do it=2C you should read books about Hadoop MapReduce and/or online tutorial. Yahoo's is old but still a nice read to begin with : =3Bhttp://deve= loper.yahoo.com/hadoop/tutorial/

Regards=2C

Bertrand


On Thu=2C Oct 4=2C 2012 at 3:58 PM= =2C Sarath <=3Bsarathchandra.josyam@a= lgofusiontech.com>=3B wrote:
Hi=2C

I have a file which has some financial transaction data. Each transaction will have amount and a credit/debit indicator.
I want to write a mapreduce program which computes cumulative credit &=3B debit amounts at each record
and append these values to the record before dumping into the output file.

Is this possible? How can I achieve this? Where should i put the logic of computing the cumulative values?

Regards=2C
Sarath.



--
Bertrand Dechoux




--
Bertrand Dechoux
= --_57f8eedd-5725-45bf-aa81-c8678bc15782_--