hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-13103) HDFS Client write acknowledgement timeout should not depend on read timeout
Date Fri, 02 Feb 2018 21:15:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-13103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei-Chiu Chuang updated HDFS-13103:
-----------------------------------
    Description: 
HDFS-8311 added a timeout for client write acknowledgement for both
 # transferring blocks
 # writing to a DataNode.

The timeout shares the same configuration as client read timeout (dfs.client.socket-timeout).

While I agree having a timeout is good, *it does not make sense for the write acknowledgement
timeout to depend on read timeout*. We saw a case where cluster admin wants to reduce HBase
RegionServer read timeout so as to detect DataNode crash quickly, but did not realize it affects
write acknowledgement timeout.

In the end, the effective DataNode write timeout is shorter than the effective client write
acknowledgement timeout. If the last two DataNodes in the write pipeline crashes, client would
think the first DataNode is faulty (the DN appears unresponsive because it is still waiting
for the ack from downstream DNs), dropping it and then HBase RS would crash because it is
unable to write to any good DataNode. This scenario is possible during a rack failure.

This problem is even worse for Cloudera Manager-managed cluster. By default, CM-managed HBase
RegionServer sets {{dfs.client.block.write.replace-datanode-on-failure.enable = true}}. Even
one unresponsive DataNode could crash HBase RegionServer.

I am raising this Jira to discuss two possible solutions
 # add a new config for write acknowledgement timeout. Do not depend on read timeout
 # or, update the description of dfs.client.socket-timeout in core-default.xml so that admin
is aware write acknowledgement timeout depends on this configuration.

  was:
HDFS-8311 added a timeout for client write acknowledgement for both
 # transferring blocks
 # writing to a DataNode.

The timeout shares the same configuration as client read timeout (dfs.client.socket-timeout).

While I agree having a timeout is good, it does not make sense for the write acknowledgement
timeout to depend on read timeout. We saw a case where cluster admin wants to reduce HBase
RegionServer read timeout so as to detect DataNode crash quickly, but did not realize it affects
write acknowledgement timeout.

In the end, the effective DataNode write timeout is shorter than the effective client write
acknowledgement timeout. If the last two DataNodes in the write pipeline crashes, client would
think the first DataNode is faulty (the DN appears unresponsive because it is still waiting
for the ack from downstream DNs), dropping it and then HBase RS would crash because it is
unable to write to any good DataNode. This scenario is possible during a rack failure.

This problem is even worse for Cloudera Manager-managed cluster. By default, CM-managed HBase
RegionServer sets {{dfs.client.block.write.replace-datanode-on-failure.enable = true}}. Even
one unresponsive DataNode could crash HBase RegionServer.

I am raising this Jira to discuss two possible solutions
 # add a new config for write acknowledgement timeout. Do not depend on read timeout
 # or, update the description of dfs.client.socket-timeout in core-default.xml so that admin
is aware write acknowledgement timeout depends on this configuration.


> HDFS Client write acknowledgement timeout should not depend on read timeout
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-13103
>                 URL: https://issues.apache.org/jira/browse/HDFS-13103
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.8.0, 3.0.0-alpha1
>         Environment: CDH5.7.0 and above + Cloudera Manager. HBase Region Server.
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>
> HDFS-8311 added a timeout for client write acknowledgement for both
>  # transferring blocks
>  # writing to a DataNode.
> The timeout shares the same configuration as client read timeout (dfs.client.socket-timeout).
> While I agree having a timeout is good, *it does not make sense for the write acknowledgement
timeout to depend on read timeout*. We saw a case where cluster admin wants to reduce HBase
RegionServer read timeout so as to detect DataNode crash quickly, but did not realize it affects
write acknowledgement timeout.
> In the end, the effective DataNode write timeout is shorter than the effective client
write acknowledgement timeout. If the last two DataNodes in the write pipeline crashes, client
would think the first DataNode is faulty (the DN appears unresponsive because it is still
waiting for the ack from downstream DNs), dropping it and then HBase RS would crash because
it is unable to write to any good DataNode. This scenario is possible during a rack failure.
> This problem is even worse for Cloudera Manager-managed cluster. By default, CM-managed
HBase RegionServer sets {{dfs.client.block.write.replace-datanode-on-failure.enable = true}}.
Even one unresponsive DataNode could crash HBase RegionServer.
> I am raising this Jira to discuss two possible solutions
>  # add a new config for write acknowledgement timeout. Do not depend on read timeout
>  # or, update the description of dfs.client.socket-timeout in core-default.xml so that
admin is aware write acknowledgement timeout depends on this configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message