hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer
Date Thu, 20 Dec 2012 09:09:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536875#comment-13536875

Siddharth Seth commented on MAPREDUCE-4842:

bq. You are right that such a scenario is possible. However, the fetcher thread will be waiting
in waitForInMemoryMerge() or it may get stalled map output. This may mitigate the problem.
I have an idea on how to eliminate this problem completely. I will verify that it will work
and post it as part of the patch later. It will be simple, I promise
Fetches can already be in progress. I did see multiple single file merges with the patch applied;
the tera-sort example that I ran - ended up with 6 on disk files to merge instead of 3 in
the current implementation. I'm not sure why the Fetcher is waiting for the InMemoryMerge
to complete. IAC, your latest patch likely takes care of this.

bq. Can you describe a scenario when this might be a problem? We can address that too.
ReduceTask logs should have the exception. I didn't look in detail, but I believe it's caused
by a notify after all merges are complete - and there's an attempt to remove an element from
the finally block.

Asokan, for this specific JIRA, I'd, at least, be more comfortable with Arun/Jason's patch
to fix this blocker, with a follow up jira to cleanup the code with the patch you posted -
this is assuming of-course that there isn't a degradation in performance. The original patch
isn't doing too much other than checking for whether a merge can run after the existing merge
completes. It's a bigger patch, but simpler in terms of functional changes.

> Shuffle race can hang reducer
> -----------------------------
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: mapreduce-4842.patch, mapreduce-4842.patch, MAPREDUCE-4842.patch,
MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked
similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told
to WAIT by the MergeManager but no merge was taking place.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message