hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jane Wayne <jane.wayne2...@gmail.com>
Subject Re: Cumulative value using mapreduce
Date Fri, 05 Oct 2012 15:21:50 GMT
there's probably a million ways to do it, but it seems like it can be done,
per your question. off the top of my head, you'd probably want to do
the cumulative sum in the reducer. if you're savy, maybe even make the
reducer reusable as a combiner (looks like this problem might have an
associative and commutative reducer).

the difficulty with this problem is that for n input records, you will have
n output records (looking at your example). furthermore, each n-th output
record requires information from all the previous (n-1) records. so, if you
have 1 billion input records, it's looking like you may have to move a lot
of intermediary key-value pairs to your reducer.

here's a suggestion and please critique, perhaps i may learn something.
let's take a naive approach. i assume you have this data in a text file
with CSV. i assume the Tx Ids are sequential, and you know what the
start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
following.

map(byteOffset, text) {
 data = parse(text)
 for i=data.txId to stopTxId
  emit(i, data)
}

reduce(txId, datas) {
 cr = 0
 dr = 0

 while datas.hasMoreItems
  data = data.nextItem //iterate
  if "dr" == data.crDrIndicator
   dr += data.amount
  else
   cr += data.amount

 emit(txId, {cr, dr})
}

what's not desirable about this pseudocode?
1. lots of intermediary key-value pairs
2. no combiner
3. requires knowledge of background information and certain assumptions
4. will definitely create "stragglers" (some mappers/reducers will take
longer to complete than others)
5. overflow issues with the cumulative sum?

i thought about the secondary sorting idea, but i'm still not sure how that
can work. what would you sort on?

one of the things i learned in programming 101, get the algorithm to work
first, then optimize later. hope this helps. please feel free to critique.
would love to learn some more.

On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <java8964@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

Mime
View raw message