hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das" <d...@yahoo-inc.com>
Subject RE: Why does MapRunner collect all intermediate key-value in memory?
Date Thu, 15 Mar 2007 04:34:49 GMT
> I looked into the implementation and noticed that all the intermediate key
> value pairs are collected in memory for the entire duration of any single
> MapRunner instance. As I understand from reading the code, the MapRunner
> keeps calling the user-defined map() method for all the key-value pairs
> assigned to it by the MapTask. The MapTask does the check for whether it
> should be dumping the intermediate key value pairs to the disk only after
> the MapRunner.run() method has returned.

This is not true. While there is a sort/spill-to-disk done at the end of the
map task, intermediary sorts/spill-to-disks are also done based on the
amount of memory consumed so far by the in-memory buffer. The memory is
capped at io.sort.mb config value.

> Now, I was facing problems because due to the nature of this application,
> I
> ended up emitting too many intermediate key-value pairs for some set of
> the
> input data getting allocated to a single MapRunner instance.  This was
> leading to JVM going OutofMemory.

While I agree that the memory model can be refined further (and will submit
a patch for Hadoop-875), you should not see these exceptions under normal
circumstances. I would recommend that you increase the heap size that a
child JVM uses by tweaking the value of mapred.child.java.opts (for e.g.,
you can try setting it to -Xmx512m or higher).

> If my understanding of the implementation is correct, then I am wondering
> if
> there is any particular reason to take this approach. A better approach
> (and
> I may be wrong here) would be to let MapRunner keep track of the memory it
> has been utilizing and if the allocations run too high then it should:
> 
> 1) Either dump the intermediate key-value pairs to disk itself. OR
> 2) Better option will be to call an API (new) provided by the MapTask that
> would dump the key-value pair to the disk and then pass the control back
> to
> the MapRunner. MapRunner will simply resume the task and return ultimately
> return in the normal way.

This is already there (the MapTask keeps track of memory usage).

> -----Original Message-----
> From: Gaurav Agarwal [mailto:gauravagarwal_4@yahoo.com]
> Sent: Thursday, March 15, 2007 4:07 AM
> To: hadoop-dev@lucene.apache.org
> Subject: Why does MapRunner collect all intermediate key-value in memory?
> 
> 
> Hi all,
> 
> I have started using Hadoop for a few of my Natural Language Processing
> applications. I was facing a problem due to the my programs throwing up
> OutOfMemory Exception during the Map phase.
> 
> I looked into the implementation and noticed that all the intermediate key
> value pairs are collected in memory for the entire duration of any single
> MapRunner instance. As I understand from reading the code, the MapRunner
> keeps calling the user-defined map() method for all the key-value pairs
> assigned to it by the MapTask. The MapTask does the check for whether it
> should be dumping the intermediate key value pairs to the disk only after
> the MapRunner.run() method has returned.
> 
> Now, I was facing problems because due to the nature of this application,
> I
> ended up emitting too many intermediate key-value pairs for some set of
> the
> input data getting allocated to a single MapRunner instance.  This was
> leading to JVM going OutofMemory.
> 
> If my understanding of the implementation is correct, then I am wondering
> if
> there is any particular reason to take this approach. A better approach
> (and
> I may be wrong here) would be to let MapRunner keep track of the memory it
> has been utilizing and if the allocations run too high then it should:
> 
> 1) Either dump the intermediate key-value pairs to disk itself. OR
> 2) Better option will be to call an API (new) provided by the MapTask that
> would dump the key-value pair to the disk and then pass the control back
> to
> the MapRunner. MapRunner will simply resume the task and return ultimately
> return in the normal way.
> 
> I am suggesting this approach as there are other applications too which
> may
> benefit if they are not restricted by this limitations.
> 
> Please let me know what your opinions on this. If this is not incorporated
> into the main Hadoop release and then I intend to add this as a patch for
> my
> applications. Do you see any obvious loopholes which I might have
> overlooked.
> 
> Thanks in advance for the help!
> 
> Regards
> Gaurav
> --
> View this message in context: http://www.nabble.com/Why-does-MapRunner-
> collect-all-intermediate-key-value-in-memory--tf3405027.html#a9484185
> Sent from the Hadoop Dev mailing list archive at Nabble.com.



Mime
View raw message