[ https://issues.apache.org/jira/browse/HDFS-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455262#comment-13455262
]
nkeywal commented on HDFS-3912:
-------------------------------
Some thinking, with an HBase bias:
- if the datanode is too busy and cannot heartbeat in a minute, we will also get timeouts
when writing the blocks (if the datanode is dead: 20s connect timeout. If it's not dead, or
if we had previously a connection, we will fail on the read timeout for the ack, it's around
1 minute by default).
- the recovery is on the critical path, so going to a suspicious node is not something you
want to do.
- things are already quite complicated, so I think I would end up with the same value for
read & write to keep them simple.
Then there is the case when many nodes are staled. I think we're in a really bad shape at
this stage... I feel that just throwing an exception is the best solution. HBase would wait
a few seconds and retry. That's better for the cluster than trying a node that is unlikely
to execute the write. But it's a kind of change vs. today's behavior.
To synthesis, this could make sense imho:
- there are enough fully alive nodes: let's use them, whatever the number of stale nodes.
- there are not enough fully alive nodes, but there are some stale nodes that we could use:
let's use the stale nodes them, at least the behavior will be backward compatible.
- there are not enough live node: as today.
> Detecting and avoiding stale datanodes for writing
> --------------------------------------------------
>
> Key: HDFS-3912
> URL: https://issues.apache.org/jira/browse/HDFS-3912
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Jing Zhao
> Assignee: Jing Zhao
>
> 1. Make stale timeout adaptive to the number of nodes marked stale in the cluster.
> 2. Consider having a separate configuration for write skipping the stale nodes.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
|