hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob R Rideout <apa...@jacobrideout.net>
Subject Shuffle In Memory OutOfMemoryError
Date Sat, 06 Mar 2010 16:31:28 GMT
Hi all,

We are seeing the following error in our reducers of a particular job:

Error: java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


After enough reducers fail the entire job fails. This error occurs
regardless of whether mapred.compress.map.output is true. We were able
to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
and.ShuffleRamManager.reserve correctly detect the the memory
available for allocation? I would think that with poor configuration
settings (and default settings in particular) the job may not be as
efficient, but wouldn't die.

Here is some more context in the logs, I have attached the full
reducer log here: http://gist.github.com/323746


2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling 4191933 bytes (435311 raw bytes) into RAM from
attempt_201003060739_0002_m_000061_0
2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
attempt_201003060739_0002_m_000202_0
2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201003060739_0002_r_000000_0 adding host
hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
failures
2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
attempt_201003060739_0002_r_000000_0 : Map output copy failure :
java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
would address the issue in 0.20.2, but it did not. Does anyone have
any comments or suggestions? Is this a bug I should file a JIRA for?

Jacob Rideout
Return Path

Mime
View raw message