hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3504) Configurable retry in DFSClient
Date Mon, 11 Jun 2012 18:42:44 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292963#comment-13292963
] 

Tsz Wo (Nicholas), SZE commented on HDFS-3504:
----------------------------------------------

> Not sure if exponential backoff is flexible enough. Typically one wants to retry every
10 sec till about a minute and then retry every 60 sec.

For exponential backoff, provided that the exponentialBackoff retry policy is used and the
average sleep time of the first retry is 1 second, we have the following.

|| n-th retry || average sleep time (seconds) || on average, the n-th retry will happen in
(seconds) ||
| 1 | 1 | 1 |
| 2 | 2 | 3 |
| 3 | 4 | 7 |
| 4 | 8 | 15 |
| 5 | 16 | 31 |
| ... |
| n | 2^(n-1) | 2^n - 1 |

The value of dfs.client.retry.max should depend on the failover time.  Suppose the failover
time is around 10 minutes.  Then setting dfs.client.retry.max=10 will take ~17 minutes to
finish all 10 reties.  However, the last few retries will sleep for a long time.  I think
it is undesirable.  Let me think about this more.

> You forgot about the connection retry.

Sure, will also change it.

> Why is MiniDfsCluster changes needed?

I just have moved the LOG message "Cluster is active" to waitActive().  I believe it is a
better place for it.

                
> Configurable retry in DFSClient
> -------------------------------
>
>                 Key: HDFS-3504
>                 URL: https://issues.apache.org/jira/browse/HDFS-3504
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 1.0.0, 2.0.0-alpha
>            Reporter: Siddharth Seth
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: h3504_20120607.patch, h3504_20120608.patch
>
>
> When NN maintenance is performed on a large cluster, jobs end up failing. This is particularly
bad for long running jobs. The client retry policy could be made configurable so that jobs
don't need to be restarted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message