hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jane Wayne <jane.wayne2...@gmail.com>
Subject Re: Cumulative value using mapreduce
Date Fri, 05 Oct 2012 15:31:47 GMT
i'm reading the other posts. i had assume you had +1 reducers.

if you just have 1 reducer, then no matter what, every key-value pair goes
there. so, in that case, i agree with java8964. you emit all records with
one key to that one reducer. make sure you apply secondary sorting (that
means you will have to come up with a composite key). when the data comes
into the reducer, just keep a running count and emit each time.

On Fri, Oct 5, 2012 at 11:21 AM, Jane Wayne <jane.wayne2978@gmail.com>wrote:

> there's probably a million ways to do it, but it seems like it can be
> done, per your question. off the top of my head, you'd probably want to do
> the cumulative sum in the reducer. if you're savy, maybe even make the
> reducer reusable as a combiner (looks like this problem might have an
> associative and commutative reducer).
>
> the difficulty with this problem is that for n input records, you will
> have n output records (looking at your example). furthermore, each n-th
> output record requires information from all the previous (n-1) records. so,
> if you have 1 billion input records, it's looking like you may have to move
> a lot of intermediary key-value pairs to your reducer.
>
> here's a suggestion and please critique, perhaps i may learn something.
> let's take a naive approach. i assume you have this data in a text file
> with CSV. i assume the Tx Ids are sequential, and you know what the
> start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
> following.
>
> map(byteOffset, text) {
>  data = parse(text)
>  for i=data.txId to stopTxId
>   emit(i, data)
> }
>
> reduce(txId, datas) {
>  cr = 0
>  dr = 0
>
>  while datas.hasMoreItems
>   data = data.nextItem //iterate
>   if "dr" == data.crDrIndicator
>    dr += data.amount
>   else
>    cr += data.amount
>
>  emit(txId, {cr, dr})
> }
>
> what's not desirable about this pseudocode?
> 1. lots of intermediary key-value pairs
> 2. no combiner
> 3. requires knowledge of background information and certain assumptions
> 4. will definitely create "stragglers" (some mappers/reducers will take
> longer to complete than others)
> 5. overflow issues with the cumulative sum?
>
> i thought about the secondary sorting idea, but i'm still not sure how
> that can work. what would you sort on?
>
> one of the things i learned in programming 101, get the algorithm to work
> first, then optimize later. hope this helps. please feel free to critique.
> would love to learn some more.
>
> On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>>  Thanks for all your responses. As suggested will go through the
>> documentation once again.
>>
>> But just to clarify, this is not my first map-reduce program. I've
>> already written a map-reduce for our product which does filtering and
>> transformation of the financial data. This is a new requirement we've got.
>> I have also did the logic of calculating the cumulative sums. But the
>> output is not coming as desired and I feel I'm not doing it right way and
>> missing something. So thought of taking a quick help from the mailing list.
>>
>> As an example, say we have records as below -
>>   Txn ID
>>  Txn Date
>>  Cr/Dr Indicator
>>  Amount
>>   1001
>>  9/22/2012
>>  CR
>>  1000
>>   1002
>>  9/25/2012
>>  DR
>>  500
>>   1003
>>  10/1/2012
>>  DR
>>  1500
>>   1004
>>  10/4/2012
>>  CR
>>  2000
>>
>> When this file passed the logic should append the below 2 columns to the
>> output for each record above -
>>   CR Cumulative Amount
>>  DR Cumulative Amount
>>   1000
>>  0
>>   1000
>>  500
>>   1000
>>  2000
>>   3000
>>  2000
>>
>> Hope the problem is clear now. Please provide your suggestions on the
>> approach to the solution.
>>
>> Regards,
>> Sarath.
>>
>>
>> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>>
>> I indeed didn't catch the cumulative sum part. Then I guess it begs for
>> what-is-often-called-a-secondary-sort, if you want to compute different
>> cumulative sums during the same job. It can be more or less easy to
>> implement depending on which API/library/tool you are using. Ted comments
>> on performance are spot on.
>>
>>  Regards
>>
>>  Bertrand
>>
>> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <java8964@hotmail.com>wrote:
>>
>>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>>> employer.
>>>
>>>  1) You need to decide the grouping elements for your cumulative. For
>>> example, an account, a department etc. In the mapper, combine these
>>> information as your omit key.
>>> 2) If you don't have any grouping requirement, you just want a
>>> cumulative sum for all your data, then send all the data to one common key,
>>> so they will all go to the same reducer.
>>> 3) When you calculate the cumulative sum, does the output need to have a
>>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>>> sorted as the order you want in the reducer.
>>> 4) In the reducer, just do the sum, omit every value per original record
>>> (Not per key).
>>>
>>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>>> you can build a HIVE schema on top of your data.
>>>
>>>  Yong
>>>
>>>  ------------------------------
>>> From: tdunning@maprtech.com
>>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>>> Subject: Re: Cumulative value using mapreduce
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Bertrand is almost right.
>>>
>>>  The only difference is that the original poster asked about cumulative
>>> sum.
>>>
>>>  This can be done in reducer exactly as Bertrand described except for
>>> two points that make it different from word count:
>>>
>>>  a) you can't use a combiner
>>>
>>>  b) the output of the program is as large as the input so it will have
>>> different performance characteristics than aggregation programs like
>>> wordcount.
>>>
>>>  Bertrand's key recommendation to go read a book is the most important
>>> advice.
>>>
>>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:
>>>
>>> Hi,
>>>
>>>  It sounds like a
>>> 1) group information by account
>>> 2) compute sum per account
>>>
>>>  If that not the case, you should precise a bit more about your context.
>>>
>>>  This computing looks like a small variant of wordcount. If you do not
>>> know how to do it, you should read books about Hadoop MapReduce and/or
>>> online tutorial. Yahoo's is old but still a nice read to begin with :
>>> http://developer.yahoo.com/hadoop/tutorial/
>>>
>>>  Regards,
>>>
>>>  Bertrand
>>>
>>>
>>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>>> sarathchandra.josyam@algofusiontech.com> wrote:
>>>
>>> Hi,
>>>
>>> I have a file which has some financial transaction data. Each
>>> transaction will have amount and a credit/debit indicator.
>>> I want to write a mapreduce program which computes cumulative credit &
>>> debit amounts at each record
>>> and append these values to the record before dumping into the output
>>> file.
>>>
>>> Is this possible? How can I achieve this? Where should i put the logic
>>> of computing the cumulative values?
>>>
>>> Regards,
>>> Sarath.
>>>
>>>
>>>
>>>
>>>   --
>>> Bertrand Dechoux
>>>
>>>
>>>
>>
>>
>>  --
>> Bertrand Dechoux
>>
>>
>

Mime
View raw message