hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Albert Chern" <albert.ch...@gmail.com>
Subject Re: Why does MapRunner collect all intermediate key-value in memory?
Date Wed, 14 Mar 2007 23:11:20 GMT
I don't think the intermediate key/value pairs were ever entirely kept in
memory.  In the current trunk the spilling code is in
org.apache.hadoop.mapred.MapTask.MapOutputBuffer.  If I remember correctly
there is also a configuration parameter that you can use to adjust how many
pairs are kept in memory before the spill, but I can't remember what it's
called.

Maybe your program is running out of memory for other reasons?

On 3/14/07, Gaurav Agarwal <gauravagarwal_4@yahoo.com> wrote:
>
>
> Hi all,
>
> I have started using Hadoop for a few of my Natural Language Processing
> applications. I was facing a problem due to the my programs throwing up
> OutOfMemory Exception during the Map phase.
>
> I looked into the implementation and noticed that all the intermediate key
> value pairs are collected in memory for the entire duration of any single
> MapRunner instance. As I understand from reading the code, the MapRunner
> keeps calling the user-defined map() method for all the key-value pairs
> assigned to it by the MapTask. The MapTask does the check for whether it
> should be dumping the intermediate key value pairs to the disk only after
> the MapRunner.run() method has returned.
>
> Now, I was facing problems because due to the nature of this application,
> I
> ended up emitting too many intermediate key-value pairs for some set of
> the
> input data getting allocated to a single MapRunner instance.  This was
> leading to JVM going OutofMemory.
>
> If my understanding of the implementation is correct, then I am wondering
> if
> there is any particular reason to take this approach. A better approach
> (and
> I may be wrong here) would be to let MapRunner keep track of the memory it
> has been utilizing and if the allocations run too high then it should:
>
> 1) Either dump the intermediate key-value pairs to disk itself. OR
> 2) Better option will be to call an API (new) provided by the MapTask that
> would dump the key-value pair to the disk and then pass the control back
> to
> the MapRunner. MapRunner will simply resume the task and return ultimately
> return in the normal way.
>
> I am suggesting this approach as there are other applications too which
> may
> benefit if they are not restricted by this limitations.
>
> Please let me know what your opinions on this. If this is not incorporated
> into the main Hadoop release and then I intend to add this as a patch for
> my
> applications. Do you see any obvious loopholes which I might have
> overlooked.
>
> Thanks in advance for the help!
>
> Regards
> Gaurav
> --
> View this message in context:
> http://www.nabble.com/Why-does-MapRunner-collect-all-intermediate-key-value-in-memory--tf3405027.html#a9484185
> Sent from the Hadoop Dev mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message