hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
Date Fri, 21 Jul 2017 15:56:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16096405#comment-16096405
] 

Daryn Sharp commented on HDFS-12136:
------------------------------------

[~jojochuang], do you have a unit test that can reliably expose the issue?  We run lots of
Spark apps and didn't see the problem in 2.7, or maybe it happens so infrequently that nobody
notices.

The basic problem is that doing IO in the lock is completely unacceptable.  Basic things like
a failing drive, a heavily utilized drive, high replication from decommissioning or a lost
node, will jam up the DN.  Just one slow drive will ruin the DN.

Here's how that becomes catastrophic:  Under sufficiently high load, DNs congest with BlockSenders
serially computing checksums.  Start decomming, heartbeat thread receives replication commands.
 Instantiating BlockSenders contend with the backlogged xceiver threads.  Heartbeats are now
delayed.  Meanwhile, clients timeout and pipelines collapse after 45s.  Clients reconstruct
pipeline but the prior xceivers may are blocked creating the BlockSender – not knowing
the client disconnected.  Blocked threads go up, eventually hitting the xceiver thread limit
(4k for us).  Surprisingly the heartbeat thread can eventually receive such little share of
the lock that 10mins elapse and the node goes "dead".  Now the replication monitor issues
even more replications, causing even more congestion in other nodes.  In a real incident,
we had ~6 random nodes flapping for hours causing sporadic missing blocks and jobs were taking
forever.

Kihwal and I think we might know a few approaches to further fix the original issue.  However
at this time I'd argue that undoing the cluster damaging performance is more important than
completely fixing the reader/append race.



> BlockSender performance regression due to volume scanner edge case
> ------------------------------------------------------------------
>
>                 Key: HDFS-12136
>                 URL: https://issues.apache.org/jira/browse/HDFS-12136
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan by reading
the last checksum of finalized blocks within the {{BlockSender}} ctor.  Unfortunately it's
holding the exclusive dataset lock to open and read the metafile multiple times  Block sender
instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high xceiver activity.
 Ex. lost node replication, balancing, or decommissioning.  The xceiver threads congest creating
block senders and impair the heartbeat processing that is contending for the same lock.  Combined
with other lock contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message