hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts
Date Sat, 30 Aug 2014 05:36:53 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116244#comment-14116244

Junping Du commented on MAPREDUCE-5891:

Thanks [~jlowe] for comments!
bq. SHUFFLE_FETCH_TIMEOUT_MS should be "mapreduce.reduce.shuffle.fetch.retry.timeout-ms"
Nice catch, done.

bq. openConnectionWithRetry calls abortConnect if stopped, but the one caller of this function
does the same thing when it returns. Maybe openConnectionWithRetry should just return if stopped?
Yes. Even caller can return directly as caller from upper layer already address it. Fixed.

bq. Nit: The code block in copyMapOutput's catch of IOException is getting really long. It
would be good to refactor some of this code into methods. Minor nit: "get failed" should be

bq. openConnectionWithRetry is being called and retries even if fetch retry is disabled
Good point, fixed.

bq. Shouldn't we be setting retryStartTime back to zero instead of endTime below?
Also good one, fixed it. 

bq. Also wondering if we should reset it after each successful transfer (e.g.: after a successful
header parse and successful shuffle)?
May not be necessary. If retryStartTime is not 0, which means this fetcher haven't successfully
make any progress since last failure of getMapOutput, it should keep trying and wait time
aggregation until timeout. 

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, MAPREDUCE-5891.patch
> To minimize the number of map fetch failures reported by reducers across an NM restart
it would be nice if reducers only reported a fetch failure after trying for at specified period
of time to retrieve the data.

This message was sent by Atlassian JIRA

View raw message