hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tao Jie (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11638) Support marking a datanode dead by DFSAdmin
Date Tue, 11 Apr 2017 01:26:41 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963715#comment-15963715

Tao Jie commented on HDFS-11638:

[~shahrs87], yes , in this case, when the datanode began to be panic(maybe kernal bug or hardware
failure), the node lost connection to the Ambari and we could not login to this node. 
Everything returned to normal once we restarted the bad node. We are trying to handle this
case automatic, so we attempt to mark the bad datanode as dead by the monitoring system(Ambari)
 when this case happen again.

> Support marking a datanode dead by DFSAdmin
> -------------------------------------------
>                 Key: HDFS-11638
>                 URL: https://issues.apache.org/jira/browse/HDFS-11638
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Tao Jie
> We have met such a circumstance that:
> Kernal error occured on one slave node, and error message like
> {code}
> Apr 1 08:48:05 xxhdn033 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [java:19096]
> Apr 1 08:48:05 xxhdn033 kernel: Modules linked in: bridge stp llc fuse autofs4 bonding
ipv6 uinput iTCO_wdt iTCO_vendor_support microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler
sb_edac edac_core joydev i2c_i801 i2c_core lpc_ich mfd_core sg ses enclosure ixgbe dca ptp
pps_core mdio ext4 jbd2 mbcache sd_mod crc_t10dif ahci megaraid_sas dm_mirror dm_region_hash
dm_log dm_mod [last unloaded: speedstep_lib]
> {code}
> The datanode process was still alive and continued to send heartbeat to the namenode,
but it could not response any command to this node and reading or writing blocks on this datanode
would fail. As a result, request to the HDFS would be slower since too many read/write timeout.
> We try to walk around this case by adding a dfsadmin command that mark such a abnormal
datanode as dead by force until it get restarted. When this case happens again, it would avoid
the client to access the error datanode.
> Any thought?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message