hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tao Jie (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-11638) Support marking a datanode dead by DFSAdmin
Date Mon, 10 Apr 2017 06:57:41 GMT
Tao Jie created HDFS-11638:
------------------------------

             Summary: Support marking a datanode dead by DFSAdmin
                 Key: HDFS-11638
                 URL: https://issues.apache.org/jira/browse/HDFS-11638
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Tao Jie


We have met such a circumstance that:
Kernal error occured on one slave node, and error message like
{code}
Apr 1 08:48:05 xxhdn033 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [java:19096]
Apr 1 08:48:05 xxhdn033 kernel: Modules linked in: bridge stp llc fuse autofs4 bonding ipv6
uinput iTCO_wdt iTCO_vendor_support microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler
sb_edac edac_core joydev i2c_i801 i2c_core lpc_ich mfd_core sg ses enclosure ixgbe dca ptp
pps_core mdio ext4 jbd2 mbcache sd_mod crc_t10dif ahci megaraid_sas dm_mirror dm_region_hash
dm_log dm_mod [last unloaded: speedstep_lib]
{code}
The datanode process was still alive and continued to send heartbeat to the namenode, but
it could not response any command to this node and reading or writing blocks on this datanode
would fail. As a result, request to the HDFS would be slower since too many read/write timeout.
We try to walk around this case by adding a dfsadmin command that mark such a abnormal datanode
as dead by force until it get restarted. When this case happens again, it would avoid the
client to access the error datanode.
Any thought?




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message