hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4842) Shuffle race can hang reducer
Date Thu, 20 Dec 2012 09:09:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536875#comment-13536875
] 

Siddharth Seth commented on MAPREDUCE-4842:
-------------------------------------------

bq. You are right that such a scenario is possible. However, the fetcher thread will be waiting
in waitForInMemoryMerge() or it may get stalled map output. This may mitigate the problem.
I have an idea on how to eliminate this problem completely. I will verify that it will work
and post it as part of the patch later. It will be simple, I promise
Fetches can already be in progress. I did see multiple single file merges with the patch applied;
the tera-sort example that I ran - ended up with 6 on disk files to merge instead of 3 in
the current implementation. I'm not sure why the Fetcher is waiting for the InMemoryMerge
to complete. IAC, your latest patch likely takes care of this.

bq. Can you describe a scenario when this might be a problem? We can address that too.
ReduceTask logs should have the exception. I didn't look in detail, but I believe it's caused
by a notify after all merges are complete - and there's an attempt to remove an element from
the finally block.

Asokan, for this specific JIRA, I'd, at least, be more comfortable with Arun/Jason's patch
to fix this blocker, with a follow up jira to cleanup the code with the patch you posted -
this is assuming of-course that there isn't a degradation in performance. The original patch
isn't doing too much other than checking for whether a merge can run after the existing merge
completes. It's a bigger patch, but simpler in terms of functional changes.




                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: mapreduce-4842.patch, mapreduce-4842.patch, MAPREDUCE-4842.patch,
MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  It looked
similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told
to WAIT by the MergeManager but no merge was taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message