hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3703) Decrease the datanode failure detection time
Date Mon, 10 Sep 2012 17:29:07 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452160#comment-13452160
] 

Suresh Srinivas commented on HDFS-3703:
---------------------------------------

Nicolas, lets open a separate jira for the issue you mentioned related to DFSInputStream#readBlockLength
instead of addressing it in this jira.

As regards to the patch in this jira, here is what my thoughts are:
# For read side the patch is straightforward. We have list of datanodes where the block is.
We re-order it based on liveness.
# However for the write site, not picking the stale node could result in an issue, especially
for small clusters. That is the reason why I think we should do the write side changes in
a related jira. We should consider making stale timeout adaptive to the number of nodes marked
stale in the cluster as discussed in the previous comments. Additionally we should consider
having a separate configuration for write skipping the stale nodes.

Some early comments for the patch:
Comments:
# Typo compartor
# Add annotation @InterfaceAudience.Private to DecomStaleComparator class
# Default stale period could be a bit longer 30s. Again I know this is arbitrary, but still
perfer longer timeout.
# Instead of BlockPlacementPolicyDefault#skipStaleNodes, rename it to checkForStaleNodes.
Currently the variable name means the opposite of what it is used for.
# Can you add description of what stale means in javadoc for DatanodeInfo#isStale(). Add pointer
to configuration that decides the stale period.
# DFS_DATANODE_STALE_STATE_ENABLE_KEY should be named DFS_NAMENODE_CHECK_STALE_DATANODE_KEY.
(DFS_NAMENODE prefix means it is used by the namenode). Change the value to {{dfs.namenode....}}
# DFS_DATANODE_STALE_STATE_INTERVAL_KEY should be named DFS_NAMENODE_STALE_DATNODE_INTERVAL_KEY.
Change the value to {{dfs.namenode...}}
# "node is staled" to "node is stale". In the same debug, it is a good idea to print the timesince
last update. This should help in debugging.
# Why reset to default value if the value is smaller? We should just print warning and continue.
# Why add public method DatanodeManager#setCheckStaleDatanodes()?
# Instead of making setHeartbeatsDisabledForTests for public, you could provide access to
that method using {{DatanodeTestUtils}} 
# Please add descritipn for the newly added properties in hdfs-default.xml and how it is used.

I have not reviewed the tests yet.


                
> Decrease the datanode failure detection time
> --------------------------------------------
>
>                 Key: HDFS-3703
>                 URL: https://issues.apache.org/jira/browse/HDFS-3703
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, name-node
>    Affects Versions: 1.0.3, 2.0.0-alpha
>            Reporter: nkeywal
>            Assignee: Suresh Srinivas
>         Attachments: HDFS-3703-branch2.patch, HDFS-3703.patch, HDFS-3703-trunk-with-write.patch
>
>
> By default, if a box dies, the datanode will be marked as dead by the namenode after
10:30 minutes. In the meantime, this datanode will still be proposed  by the nanenode to write
blocks or to read replicas. It happens as well if the datanode crashes: there is no shutdown
hooks to tell the nanemode we're not there anymore.
> It especially an issue with HBase. HBase regionserver timeout for production is often
30s. So with these configs, when a box dies HBase starts to recover after 30s and, while 10
minutes, the namenode will consider the blocks on the same box as available. Beyond the write
errors, this will trigger a lot of missed reads:
> - during the recovery, HBase needs to read the blocks used on the dead box (the ones
in the 'HBase Write-Ahead-Log')
> - after the recovery, reading these data blocks (the 'HBase region') will fail 33% of
the time with the default number of replica, slowering the data access, especially when the
errors are socket timeout (i.e. around 60s most of the time). 
> Globally, it would be ideal if HDFS settings could be under HBase settings. 
> As a side note, HBase relies on ZooKeeper to detect regionservers issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message