hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Watt <sw...@us.ibm.com>
Subject Re: How do I sum by Key in the Reduce Phase AND keep the initial value
Date Tue, 12 Jan 2010 20:54:03 GMT
Thanks for responding Amogh.

I'm using Hadoop 0.20.1 and see by the JIRA you mentioned its resolved in 
0.21. Bummer...  I've thought about the same thing you mentioned however, 
its my understanding that keeping those values or records in memory is 
dangerous as you can run out of memory depending on how many values you 
have (and I have a big dataset).  Really, what I am trying to understand 
here is the Map Reduce Pattern for solving this type of problem. I think 
until we have a reduce values iterator we can move through more than once, 
I believe the pattern would be:

1) Have the first job simply store the key and the sum value
2) By using the same keys, one would have the second job append the value 
from the first job to each record in the reducer. This would be achieved 
by FIRST going to the HDFS and looking up the value for the key from the 
first job and then iterating through the values for all the keys on the 
second job and appending the sum value to each record.

Kind regards
Steve Watt

Amogh Vasekar <amogh@yahoo-inc.com>
"mapreduce-user@hadoop.apache.org" <mapreduce-user@hadoop.apache.org>
01/12/2010 02:01 PM
Re: How do I sum by Key in the Reduce Phase AND keep the initial value

I ran into a very similar situation quite some time back and had then 
encountered this : http://issues.apache.org/jira/browse/HADOOP-475
After speaking to a few Hadoop folks, they had said complete cloning was 
not a straightforward option for some optimization reasons.
There were a few things I tried , to run this in a single MR job emitting 
<k,v>  from mapper one more time with some tagging info ( this bumped up 
S&S phase by quite a lot ); run a map only successor job etc. But keeping 
records in memory and writing to disk after certain threshold amount 
worked pretty well for me ( all this on Hadoop 0.17.2 )
Anyways, they seem to have resolved it in next Hadoop release.


On 1/12/10 10:29 PM, "Stephen Watt" <swatt@us.ibm.com> wrote:

The Key Value pairs coming into my Reducer are as Follows 

KEY(Text)        VALUE(IntWritable) 
A               11 
A               9 
B               2
B                3 

I want my reducer to sum the Values for each input key and then output the 
key with a Text Value containing the original value and the sum. 

KEY(Text)        VALUE(Text) 
A               11        20 
A               9        20 
B               2        5
B                3         5 

Here is the issue :  In the reducer, I am iterating through the values for 
each using values.iterator() and storing the total amount in a variable. 
Then I am TRYING to iterate through the keys again, except this time, 
writing the new value (A, new Text("11 20") in the output collector to 
create the Value structure displayed in the example above. This fails 
because it appears I can only iterate through the values for each key 
ONCE. I know this because additional attempts to get new iterators from 
the context or the Iterable type thats passed into the reducer always 
return false on the initial hasNext(). 

I have to iterate through it twice because the first time I have to sum 
the values and the second time I need to write the write the initial (11) 
value and the sum(20) as I need both values as part of a calculation in 
the next job. Any ideas on how to do this ? 

Kind regards
Steve Watt

View raw message