hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GUOJUN Zhu <guojun_...@freddiemac.com>
Subject RE: Map Reduce Theory Question, getting OutOfMemoryError while reducing
Date Fri, 29 Jun 2012 17:32:09 GMT
If you are referring the iterable in the reducer, they are special and not 
in the memory at all.  Once the iterator pass a value, it is lost and you 
cannot recover it.  There is nothing like linkedlist in behind. 

Zhu, Guojun
Modeling Sr Graduate
571-3824370
guojun_zhu@freddiemac.com
Financial Engineering
Freddie Mac



   "Berry, Matt" <mwberry@amazon.com> 
   06/29/2012 01:06 PM
   Please respond to
mapreduce-user@hadoop.apache.org


To
"mapreduce-user@hadoop.apache.org" <mapreduce-user@hadoop.apache.org>
cc

Subject
RE: Map Reduce Theory Question, getting OutOfMemoryError while reducing






I was actually quite curious as to how Hadoop was managing to get all of 
the records into the Iterable in the first place. I thought they were 
using a very specialized object that implements Iterable, but a heap dump 
shows they're likely  just using a LinkedList. All I was doing was 
duplicating that object. Supposing I do as you suggest, am I in danger of 
having their list consume all the memory if a user decides to log 2x or 3x 
as much as they did this time?

~Matt

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Friday, June 29, 2012 6:52 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while 
reducing

Hey Matt,

As far as I can tell, Hadoop isn't at fault here truly.

If your issue is that you collect in a list before you store, you should 
focus on that and just avoid collecting it completely. Why don't you 
serialize as you receive, if the incoming order is already taken care of? 
As far as I can tell, your AggregateRecords probably does nothing else but 
serialize the stored LinkedList. So instead of using a LinkedList, or even 
a composed Writable such as AggregateRecords, just write them in as you 
receive them via each .next(). Would this not work for you? You may batch 
a constant bit to gain some write performance but at least you won't have 
to use up your memory.

You can serialize as you receive by following this:
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F



--
Harsh J


Mime
View raw message