Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com
 designates 65.55.111.106 as permitted sender)
Message-ID: <BLU162-W58B464AE00C4F0FFA83AED08B0@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_57f8eedd-5725-45bf-aa81-c8678bc15782_"
From: java8964 java8964 <java8964@hotmail.com>
To: <user@hadoop.apache.org>
Subject: RE: Cumulative value using mapreduce
Date: Fri, 5 Oct 2012 10:03:27 -0400
Importance: Normal
In-Reply-To: <506E688B.2060900@algofusiontech.com>
References: <506D9618.3050100@algofusiontech.com>
 <CAO6W-2fgj+LuRTK6Y+NXZBcMQpYLp2sNgVgUj5AW58HhMcf3Ww@mail.gmail.com>
 <CAND0qzs=4CPNRCsjFFsMvLS5TTr62=39+jJmk1PNq6SQDibJCQ@mail.gmail.com>
 <BLU162-W58A5A1EFF32C7AFD15F71CD0840@phx.gbl>
 <CAO6W-2erOQFYenVJj4SQvxhi2x_TKrDgfPZtLoP_jvqeafLq+Q@mail.gmail.com>,<506E688B.2060900@algofusiontech.com>
MIME-Version: 1.0

--_57f8eedd-5725-45bf-aa81-c8678bc15782_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


Are you allowed to change the order of the data in the output? If you want =
to calculate the cr/dr indicator cumulative sum value=2C then it will easy =
if the business allow you to change the order of your data group by CR/DR i=
ndicator in the output.
For example=2C you can do it very easy with the way I described in my origi=
nal email if you CAN change the output like following:
Txn ID             Cr/Dr Indicator         Amount      CR cumulative Amount=
       Dr Cumulative Amount1001                   CR                       =
  1000                1000                                          01004  =
                 CR                         2000                3000       =
                                   01002                   DR              =
            500                 0                                          =
    5001003                   DR                         1500              =
  0                                            2000
As you can see=2C you have to group out your output by the Cr/Dr Indicator.=
 If you want to keep the original order=2C then it is hard=2C at least I ca=
nnot think a way in short time.
But if you allow to change the order of the output=2C then it is called cum=
ulative sum with grouping (in this case=2C it is group1 for CR=2C group 2 f=
or DR).=20
1) In the mapper=2C omit your data by Cr/Dr indicator=2C which will group t=
he data by CR/DR. So all CR data will go to one reducer=2C then all DR data=
 will go to one reducer.2) Besides grouping the data=2C if you want the out=
put sorted by the Amount (for example) in each group=2C then you have to do=
 the 2nd sorting. Google 2nd sort. Then for each group=2C the data arriving=
 into each reducer will be sorted by amount. Otherwise=2C if you don't need=
 that sorting=2C then just ignore the 2nd sorting.3) In each reducer=2C the=
 data arriving should be already grouped. The default partitioner for MR jo=
b is Hash Partitioner. Depending on the hashCode() return for 'CR' and 'DR'=
=2C these 2 groups data could go to different reducers (assuming you are ru=
nning with multi reducers)=2C or they could go to the same reducers. But ev=
en they are going to the same reducer=2C they will be arrived into 2 groups=
. So the output of your reducers will be grouped=2C which is sorted by the =
way.4) In your reducers=2C for the same group data=2C you will get an array=
 of values. For CR=2C you will get all the CR records in the array. What yo=
u need to do is to Iterating your array=2C for every element=2C calculating=
 the cumulative sum=2C and omit the cumulative sum with the each record out=
.5) In the end=2C your output could be multi files=2C as each file generate=
d from one reducer. You can merge them into one file=2C or just leave them =
as that in the HDFS.6) For best performance=2C if you have huge data=2C AND=
 you know all your possible value for THE Indicator=2C you may want to cons=
ider use your own custom Partitioner=2C instead of HashPartitioner. What yo=
u want is like a RoundRobin distribution of your keys inside the available =
reducers=2C instead of Random distribution by hash value(). Keep in mind th=
at random distribution DOES NOT work well if the distinct count of your key=
s is small enough.
Yong


Date: Fri=2C 5 Oct 2012 10:26:43 +0530
From: sarathchandra.josyam@algofusiontech.com
To: user@hadoop.apache.org
Subject: Re: Cumulative value using mapreduce


 =20
   =20
 =20
 =20
    Thanks for all your responses. As
      suggested will go through the documentation once again.

     =20

      But just to clarify=2C this is not my first map-reduce program. I've
      already written a map-reduce for our product which does filtering
      and transformation of the financial data. This is a new
      requirement we've got. I have also did the logic of calculating
      the cumulative sums. But the output is not coming as desired and I
      feel I'm not doing it right way and missing something. So thought
      of taking a quick help from the mailing list.

     =20

      As an example=2C say we have records as below -

     =20
       =20
         =20
            Txn ID

           =20
            Txn Date

           =20
            Cr/Dr Indicator

           =20
            Amount

           =20
         =20
         =20
            1001

           =20
            9/22/2012

           =20
            CR

           =20
            1000

           =20
         =20
         =20
            1002

           =20
            9/25/2012

           =20
            DR

           =20
            500

           =20
         =20
         =20
            1003

           =20
            10/1/2012

           =20
            DR

           =20
            1500

           =20
         =20
         =20
            1004

           =20
            10/4/2012

           =20
            CR

           =20
            2000

           =20
         =20
       =20
     =20
     =20

      When this file passed the logic should append the below 2 columns
      to the output for each record above -

     =20
       =20
         =20
            CR Cumulative Amount

           =20
            DR Cumulative Amount

           =20
         =20
         =20
            1000

           =20
            0

           =20
         =20
         =20
            1000

           =20
            500

           =20
         =20
         =20
            1000

           =20
            2000

           =20
         =20
         =20
            3000

           =20
            2000

           =20
         =20
       =20
     =20
     =20

      Hope the problem is clear now. Please provide your suggestions on
      the approach to the solution.

     =20

      Regards=2C

      Sarath.

     =20

      On Friday 05 October 2012 02:51 AM=2C Bertrand Dechoux wrote:

   =20
    I indeed didn't catch the cumulative sum part. Then I
      guess it begs for what-is-often-called-a-secondary-sort=2C if you
      want to compute different cumulative sums during the same job. It
      can be more or less easy to implement depending on which
      API/library/tool you are using. Ted comments on performance are
      spot on.
     =20
       =20

     =20
      Regards
     =20

     =20
      Bertrand

       =20

        On Thu=2C Oct 4=2C 2012 at 9:02 PM=2C
          java8964 java8964 <java8964@hotmail.com> wrote:

         =20
           =20
             =20
                I did the cumulative sum in the HIVE UDF=2C as one of the
                project for my employer.
               =20

               =20
                1) You need to decide the grouping elements for
                  your cumulative. For example=2C an account=2C a departmen=
t
                  etc. In the mapper=2C combine these information as your
                  omit key.
                2) If you don't have any grouping requirement=2C you
                  just want a cumulative sum for all your data=2C then
                  send all the data to one common key=2C so they will all
                  go to the same reducer.
                3) When you calculate the cumulative sum=2C does the
                  output need to have a sorting order? If so=2C you need
                  to do the 2nd sorting=2C so the data will be sorted as
                  the order you want in the reducer.
                4) In the reducer=2C just do the sum=2C omit every
                  value per original record (Not per key).
               =20

               =20
                I will suggest you do this in the UDF of HIVE=2C as
                  it is much easy=2C if you can build a HIVE schema on top
                  of your data.
               =20

               =20
                Yong

                 =20

                 =20
                    From: tdunning@maprtech.com

                    Date: Thu=2C 4 Oct 2012 18:52:09 +0100

                    Subject: Re: Cumulative value using mapreduce

                    To: user@hadoop.apache.org
                   =20
                     =20

                       =20

                        Bertrand is almost right.
                       =20

                       =20
                        The only difference is that the original
                          poster asked about cumulative sum.
                       =20

                       =20
                        This can be done in reducer exactly as
                          Bertrand described except for two points that
                          make it different from word count:
                       =20

                       =20
                        a) you can't use a combiner
                       =20

                       =20
                        b) the output of the program is as large as
                          the input so it will have different
                          performance characteristics than aggregation
                          programs like wordcount.
                       =20

                       =20
                        Bertrand's key recommendation to go read a
                          book is the most important advice.

                         =20

                          On Thu=2C Oct 4=2C 2012 at 5:20 PM=2C Bertrand
                            Dechoux <dechouxb@gmail.com>
                            wrote:

                            Hi=2C
                             =20

                             =20
                              It sounds like a
                              1) group information by account
                              2) compute sum per account
                             =20

                             =20
                              If that not the case=2C you should
                                precise a bit more about your context.
                             =20
                               =20

                             =20
                              This computing looks like a small
                                variant of wordcount. If you do not know
                                how to do it=2C you should read books
                                about Hadoop MapReduce and/or online
                                tutorial. Yahoo's is old but still a
                                nice read to begin with : http://developer.=
yahoo.com/hadoop/tutorial/
                             =20

                             =20
                              Regards=2C
                             =20

                             =20
                              Bertrand
                               =20
                                 =20

                                   =20

                                    On Thu=2C Oct 4=2C 2012 at 3:58 PM=2C
                                      Sarath <sarathchandra.josyam@algofusi=
ontech.com>
                                      wrote:

                                      Hi=2C

                                       =20

                                        I have a file which has some
                                        financial transaction data. Each
                                        transaction will have amount and
                                        a credit/debit indicator.

                                        I want to write a mapreduce
                                        program which computes
                                        cumulative credit & debit
                                        amounts at each record

                                        and append these values to the
                                        record before dumping into the
                                        output file.

                                       =20

                                        Is this possible? How can I
                                        achieve this? Where should i put
                                        the logic of computing the
                                        cumulative values?

                                       =20

                                        Regards=2C

                                        Sarath.

                                     =20
                                   =20
                                   =20

                                   =20
                                   =20

                                   =20
                                 =20
                               =20
                                --=20

                                    Bertrand Dechoux

                                 =20
                           =20
                         =20
                         =20

                       =20
                     =20
                   =20
                 =20
               =20
             =20
           =20
         =20
       =20
       =20

       =20
       =20

       =20
        --=20

        Bertrand Dechoux

     =20
     		 	   		  =

--_57f8eedd-5725-45bf-aa81-c8678bc15782_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 10pt=3B
font-family:Tahoma
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>
Are you allowed to change the order of the data in the output? If you want =
to calculate the cr/dr indicator cumulative sum value=2C then it will easy =
if the business allow you to change the order of your data group by CR/DR i=
ndicator in the output.<div><br></div><div>For example=2C you can do it ver=
y easy with the way I described in my original email if you CAN change the =
output like following:</div><div><br></div><div>Txn ID &nbsp=3B &nbsp=3B &n=
bsp=3B &nbsp=3B &nbsp=3B &nbsp=3B Cr/Dr Indicator &nbsp=3B &nbsp=3B &nbsp=
=3B &nbsp=3B Amount &nbsp=3B &nbsp=3B &nbsp=3BCR cumulative Amount &nbsp=3B=
 &nbsp=3B &nbsp=3B Dr Cumulative Amount</div><div>1001 &nbsp=3B &nbsp=3B &n=
bsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B CR &nbsp=3B &n=
bsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=
=3B &nbsp=3B &nbsp=3B 1000 &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nb=
sp=3B &nbsp=3B &nbsp=3B1000 &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &n=
bsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=
=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B0</div><d=
iv>1004 &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbs=
p=3B &nbsp=3B CR &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbs=
p=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B 2000 &nbsp=3B &nbsp=3B &n=
bsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B3000 &nbsp=3B &nbsp=3B &=
nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbs=
p=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=
=3B &nbsp=3B &nbsp=3B0</div><div>1002 &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &=
nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B DR &nbsp=3B &nbsp=3B &nbsp=3B &=
nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbs=
p=3B &nbsp=3B500 &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbs=
p=3B &nbsp=3B 0 &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=
=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B=
 &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B500=
</div><div>1003 &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=
=3B &nbsp=3B &nbsp=3B DR &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=
=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B 1500 &nbsp=3B &nb=
sp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B0 &nbsp=3B &nbsp=
=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B=
 &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &nbsp=3B &n=
bsp=3B &nbsp=3B &nbsp=3B &nbsp=3B2000</div><div><br></div><div>As you can s=
ee=2C you have to group out your output by the Cr/Dr Indicator. If you want=
 to keep the original order=2C then it is hard=2C at least I cannot think a=
 way in short time.</div><div><br></div><div>But if you allow to change the=
 order of the output=2C then it is called cumulative sum with grouping (in =
this case=2C it is group1 for CR=2C group 2 for DR).&nbsp=3B</div><div><br>=
</div><div>1) In the mapper=2C omit your data by Cr/Dr indicator=2C which w=
ill group the data by CR/DR. So all CR data will go to one reducer=2C then =
all DR data will go to one reducer.</div><div>2) Besides grouping the data=
=2C if you want the output sorted by the Amount (for example) in each group=
=2C then you have to do the 2nd sorting. Google 2nd sort. Then for each gro=
up=2C the data arriving into each reducer will be sorted by amount. Otherwi=
se=2C if you don't need that sorting=2C then just ignore the 2nd sorting.</=
div><div>3) In each reducer=2C the data arriving should be already grouped.=
 The default partitioner for MR job is Hash Partitioner. Depending on the h=
ashCode() return for 'CR' and 'DR'=2C these 2 groups data could go to diffe=
rent reducers (assuming you are running with multi reducers)=2C or they cou=
ld go to the same reducers. But even they are going to the same reducer=2C =
they will be arrived into 2 groups. So the output of your reducers will be =
grouped=2C which is sorted by the way.</div><div>4) In your reducers=2C for=
 the same group data=2C you will get an array of values. For CR=2C you will=
 get all the CR records in the array. What you need to do is to Iterating y=
our array=2C for every element=2C calculating the cumulative sum=2C and omi=
t the cumulative sum with the each record out.</div><div>5) In the end=2C y=
our output could be multi files=2C as each file generated from one reducer.=
 You can merge them into one file=2C or just leave them as that in the HDFS=
.</div><div>6) For best performance=2C if you have huge data=2C AND you kno=
w all your possible value for THE Indicator=2C you may want to consider use=
 your own custom Partitioner=2C instead of HashPartitioner. What you want i=
s like a RoundRobin distribution of your keys inside the available reducers=
=2C instead of Random distribution by hash value(). Keep in mind that rando=
m distribution DOES NOT work well if the distinct count of your keys is sma=
ll enough.</div><div><br></div><div>Yong</div><div><br></div><div><br><br><=
div><div id=3D"SkyDrivePlaceholder"></div><hr id=3D"stopSpelling">Date: Fri=
=2C 5 Oct 2012 10:26:43 +0530<br>From: sarathchandra.josyam@algofusiontech.=
com<br>To: user@hadoop.apache.org<br>Subject: Re: Cumulative value using ma=
preduce<br><br>
 =20
   =20
 =20
 =20
    <div class=3D"ecxmoz-cite-prefix">Thanks for all your responses. As
      suggested will go through the documentation once again.<br>
      <br>
      But just to clarify=2C this is not my first map-reduce program. I've
      already written a map-reduce for our product which does filtering
      and transformation of the financial data. This is a new
      requirement we've got. I have also did the logic of calculating
      the cumulative sums. But the output is not coming as desired and I
      feel I'm not doing it right way and missing something. So thought
      of taking a quick help from the mailing list.<br>
      <br>
      As an example=2C say we have records as below -<br>
      <table border=3D"1" cellpadding=3D"2" cellspacing=3D"2" width=3D"60%"=
>
        <tbody>
          <tr>
            <td valign=3D"top">Txn ID<br>
            </td>
            <td valign=3D"top">Txn Date<br>
            </td>
            <td valign=3D"top">Cr/Dr Indicator<br>
            </td>
            <td valign=3D"top">Amount<br>
            </td>
          </tr>
          <tr>
            <td valign=3D"top">1001<br>
            </td>
            <td valign=3D"top">9/22/2012<br>
            </td>
            <td valign=3D"top">CR<br>
            </td>
            <td valign=3D"top">1000<br>
            </td>
          </tr>
          <tr>
            <td valign=3D"top">1002<br>
            </td>
            <td valign=3D"top">9/25/2012<br>
            </td>
            <td valign=3D"top">DR<br>
            </td>
            <td valign=3D"top">500<br>
            </td>
          </tr>
          <tr>
            <td valign=3D"top">1003<br>
            </td>
            <td valign=3D"top">10/1/2012<br>
            </td>
            <td valign=3D"top">DR<br>
            </td>
            <td valign=3D"top">1500<br>
            </td>
          </tr>
          <tr>
            <td valign=3D"top">1004<br>
            </td>
            <td valign=3D"top">10/4/2012<br>
            </td>
            <td valign=3D"top">CR<br>
            </td>
            <td valign=3D"top">2000<br>
            </td>
          </tr>
        </tbody>
      </table>
      <br>
      When this file passed the logic should append the below 2 columns
      to the output for each record above -<br>
      <table border=3D"1" cellpadding=3D"2" cellspacing=3D"2" width=3D"40%"=
>
        <tbody>
          <tr>
            <td valign=3D"top">CR Cumulative Amount<br>
            </td>
            <td valign=3D"top">DR Cumulative Amount<br>
            </td>
          </tr>
          <tr>
            <td valign=3D"top">1000<br>
            </td>
            <td valign=3D"top">0<br>
            </td>
          </tr>
          <tr>
            <td valign=3D"top">1000<br>
            </td>
            <td valign=3D"top">500<br>
            </td>
          </tr>
          <tr>
            <td valign=3D"top">1000<br>
            </td>
            <td valign=3D"top">2000<br>
            </td>
          </tr>
          <tr>
            <td valign=3D"top">3000<br>
            </td>
            <td valign=3D"top">2000<br>
            </td>
          </tr>
        </tbody>
      </table>
      <br>
      Hope the problem is clear now. Please provide your suggestions on
      the approach to the solution.<br>
      <br>
      Regards=2C<br>
      Sarath.<br>
      <br>
      On Friday 05 October 2012 02:51 AM=2C Bertrand Dechoux wrote:<br>
    </div>
    <blockquote cite=3D"mid:CAO6W-2erOQFYenVJj4SQvxhi2x_TKrDgfPZtLoP_jvqeaf=
Lq+Q@mail.gmail.com">I indeed didn't catch the cumulative sum part. Then I
      guess it begs for what-is-often-called-a-secondary-sort=2C if you
      want to compute different cumulative sums during the same job. It
      can be more or less easy to implement depending on which
      API/library/tool you are using. Ted comments on performance are
      spot on.
      <div>
        <br>
      </div>
      <div>Regards</div>
      <div><br>
      </div>
      <div>Bertrand<br>
        <br>
        <div class=3D"ecxgmail_quote">On Thu=2C Oct 4=2C 2012 at 9:02 PM=2C
          java8964 java8964 <span dir=3D"ltr">&lt=3B<a href=3D"mailto:java8=
964@hotmail.com">java8964@hotmail.com</a>&gt=3B</span> wrote:<br>
          <blockquote class=3D"ecxgmail_quote" style=3D"border-left:1px #cc=
c solid=3Bpadding-left:1ex">
            <div>
              <div dir=3D"ltr">
                I did the cumulative sum in the HIVE UDF=2C as one of the
                project for my employer.
                <div><br>
                </div>
                <div>1) You need to decide the grouping elements for
                  your cumulative. For example=2C an account=2C a departmen=
t
                  etc. In the mapper=2C combine these information as your
                  omit key.</div>
                <div>2) If you don't have any grouping requirement=2C you
                  just want a cumulative sum for all your data=2C then
                  send all the data to one common key=2C so they will all
                  go to the same reducer.</div>
                <div>3) When you calculate the cumulative sum=2C does the
                  output need to have a sorting order? If so=2C you need
                  to do the 2nd sorting=2C so the data will be sorted as
                  the order you want in the reducer.</div>
                <div>4) In the reducer=2C just do the sum=2C omit every
                  value per original record (Not per key).</div>
                <div><br>
                </div>
                <div>I will suggest you do this in the UDF of HIVE=2C as
                  it is much easy=2C if you can build a HIVE schema on top
                  of your data.</div>
                <div><br>
                </div>
                <div>Yong<br>
                  <br>
                  <div>
                    <hr>From: <a href=3D"mailto:tdunning@maprtech.com">tdun=
ning@maprtech.com</a><br>
                    Date: Thu=2C 4 Oct 2012 18:52:09 +0100<br>
                    Subject: Re: Cumulative value using mapreduce<br>
                    To: <a href=3D"mailto:user@hadoop.apache.org">user@hado=
op.apache.org</a>
                    <div>
                      <div class=3D"h5"><br>
                        <br>
                        Bertrand is almost right.
                        <div><br>
                        </div>
                        <div>The only difference is that the original
                          poster asked about cumulative sum.</div>
                        <div><br>
                        </div>
                        <div>This can be done in reducer exactly as
                          Bertrand described except for two points that
                          make it different from word count:</div>
                        <div><br>
                        </div>
                        <div>a) you can't use a combiner</div>
                        <div><br>
                        </div>
                        <div>b) the output of the program is as large as
                          the input so it will have different
                          performance characteristics than aggregation
                          programs like wordcount.</div>
                        <div><br>
                        </div>
                        <div>Bertrand's key recommendation to go read a
                          book is the most important advice.<br>
                          <br>
                          <div>On Thu=2C Oct 4=2C 2012 at 5:20 PM=2C Bertra=
nd
                            Dechoux <span dir=3D"ltr">&lt=3B<a href=3D"mail=
to:dechouxb@gmail.com">dechouxb@gmail.com</a>&gt=3B</span>
                            wrote:<br>
                            <blockquote style=3D"border-left:1px #ccc\00000=
d\00000a                              solid=3Bpadding-left:1ex">Hi=2C
                              <div><br>
                              </div>
                              <div>It sounds like a</div>
                              <div>1) group information by account</div>
                              <div>2) compute sum per account</div>
                              <div><br>
                              </div>
                              <div>If that not the case=2C you should
                                precise a bit more about your context.</div=
>
                              <div>
                                <br>
                              </div>
                              <div>This computing looks like a small
                                variant of wordcount. If you do not know
                                how to do it=2C you should read books
                                about Hadoop MapReduce and/or online
                                tutorial. Yahoo's is old but still a
                                nice read to begin with :&nbsp=3B<a href=3D=
"http://developer.yahoo.com/hadoop/tutorial/" target=3D"_blank">http://deve=
loper.yahoo.com/hadoop/tutorial/</a></div>
                              <div><br>
                              </div>
                              <div>Regards=2C</div>
                              <div><br>
                              </div>
                              <div>Bertrand
                                <div>
                                  <div><br>
                                    <br>
                                    <div>On Thu=2C Oct 4=2C 2012 at 3:58 PM=
=2C
                                      Sarath <span dir=3D"ltr">&lt=3B<a hre=
f=3D"mailto:sarathchandra.josyam@algofusiontech.com">sarathchandra.josyam@a=
lgofusiontech.com</a>&gt=3B</span>
                                      wrote:<br>
                                      <blockquote style=3D"border-left:1px\=
00000d\00000a                                        #ccc solid=3Bpadding-l=
eft:1ex">Hi=2C<br>
                                        <br>
                                        I have a file which has some
                                        financial transaction data. Each
                                        transaction will have amount and
                                        a credit/debit indicator.<br>
                                        I want to write a mapreduce
                                        program which computes
                                        cumulative credit &amp=3B debit
                                        amounts at each record<br>
                                        and append these values to the
                                        record before dumping into the
                                        output file.<br>
                                        <br>
                                        Is this possible? How can I
                                        achieve this? Where should i put
                                        the logic of computing the
                                        cumulative values?<br>
                                        <br>
                                        Regards=2C<br>
                                        Sarath.<br>
                                      </blockquote>
                                    </div>
                                    <br>
                                    <br clear=3D"all">
                                    <div><br>
                                    </div>
                                  </div>
                                </div>
                                <span><font color=3D"#888888">-- <br>
                                    Bertrand Dechoux<br>
                                  </font></span></div>
                            </blockquote>
                          </div>
                          <br>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
        <br clear=3D"all">
        <div><br>
        </div>
        -- <br>
        Bertrand Dechoux<br>
      </div>
    </blockquote></div></div> 		 	   		  </div></body>
</html>=

--_57f8eedd-5725-45bf-aa81-c8678bc15782_--