hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6351) Reducer hung in copy phase.
Date Mon, 04 May 2015 13:45:06 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526636#comment-14526636
] 

Jason Lowe commented on MAPREDUCE-6351:
---------------------------------------

I suspect this is a duplicate of MAPREDUCE-6334.  I see a lot of these types of messages in
the reducer log:
{noformat}
2015-05-01 19:59:37,632 WARN [fetcher#13] org.apache.hadoop.mapreduce.task.reduce.Fetcher:
Shuffle output from glgs1190.grid.uh1.inmobi.com:13562 failed, retry it.
{noformat}

I think it is leaking memory allocations from the shuffle errors and the shuffle buffer runs
out of available memory (hence fetchers told to WAIT) but there isn't enough data in the shuffle
buffer to trigger a merge.  All of the memory that was leaked will never complete to kick
off the merge and unblock the other threads.

> Reducer hung in copy phase.
> ---------------------------
>
>                 Key: MAPREDUCE-6351
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6351
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.6.0
>            Reporter: Laxman
>         Attachments: jstat-gc.log, reducer-container-partial.log.zip, thread-dumps.out
>
>
> *Problem*
> Reducer gets stuck in copy phase and doesn't make progress for very long time. After
killing this task for couple of times manually, it gets completed. 
> *Observations*
> - Verfied gc logs. Found no memory related issues. Attached the logs.
> - Verified thread dumps. Found no thread related problems. 
> - On verification of logs, fetcher threads are not copying the map outputs and they are
just waiting for merge to happen.
> - Merge thread is alive and in wait state.
> *Analysis* 
> On careful observation of logs, thread dumps and code, this looks to me like a classic
case of multi-threading issue. Thread goes to wait state after it has been notified. 
> Here is the suspect code flow.
> *Thread #1*
> Fetcher thread - notification comes first
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>)
> {code}
>       synchronized(pendingToBeMerged) {
>         pendingToBeMerged.addLast(toMergeInputs);
>         pendingToBeMerged.notifyAll();
>       }
> {code}
> *Thread #2*
> Merge Thread - goes to wait state (Notification goes unconsumed)
> org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()
> {code}
>         synchronized (pendingToBeMerged) {
>           while(pendingToBeMerged.size() <= 0) {
>             pendingToBeMerged.wait();
>           }
>           // Pickup the inputs to merge.
>           inputs = pendingToBeMerged.removeFirst();
>         }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message