hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12645) FSDatasetImpl lock will stall BP service actors and may cause missing blocks
Date Thu, 12 Oct 2017 14:57:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202063#comment-16202063
] 

Daryn Sharp commented on HDFS-12645:
------------------------------------

I understand the lifeline protocol was designed to avoid the node being declared dead, but
it's just hiding the consequences of a poor locking design.  Preventing a dead node via a
lifeline is of dubious value when the node is effectively dead due to blocked IO in the dataset
lock.  The node can't process replications which may lead to data loss when another node could
have serviced the replication request.  Ex.  The lifeline will keep a node "alive" even though
it's having severe hw issues and ultimately crashed.

> FSDatasetImpl lock will stall BP service actors and may cause missing blocks
> ----------------------------------------------------------------------------
>
>                 Key: HDFS-12645
>                 URL: https://issues.apache.org/jira/browse/HDFS-12645
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Daryn Sharp
>
> The DN is extremely susceptible to a slow volume due bad locking practices.  DN operations
require a fs dataset lock.  IO in the dataset lock should not be permissible as it leads to
severe performance degradation and possibly (temporarily) missing blocks.
> A slow disk will cause pipelines to experience significant latency and timeouts, increasing
lock/io contention while cleaning up, leading to more timeouts, etc.  Meanwhile, the actor
service thread is interleaving multiple lock acquire/releases with xceivers.  If many commands
are issued, the node may be incorrectly declared as dead.
> HDFS-12639 documents that both actors synchronize on the offer service lock while processing
commands.  A backlogged active actor will block the standby actor and cause it to go dead
too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message