hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case
Date Tue, 18 Jul 2017 16:12:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091766#comment-16091766
] 

Wei-Chiu Chuang commented on HDFS-12136:
----------------------------------------

Hi [~daryn] sorry come to this late. Thanks for the patch and thanks [~kihwal] for the stack
trace. No doubt this is a serious performance regression.

I want to emphasis that the initial (false positive) corruption is due to a race condition
between concurrent readers and writers. While HDFS-11160 made it seems like it only happen
to VolumeScanner, the thing is that this can happen to any readers. When reader thinks it
gets a checksum corruption, it reports to NN, which removes the block replica. This happens
very frequent for a Spark Streaming application, and data is being read in real-time while
data is being ingested.

If you want to go this route, please add the same check at dfsclient reader side. For example,
when it receives a checksum error, read again to weed out false false positive caused by race
condition.

> BlockSender performance regression due to volume scanner edge case
> ------------------------------------------------------------------
>
>                 Key: HDFS-12136
>                 URL: https://issues.apache.org/jira/browse/HDFS-12136
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan by reading
the last checksum of finalized blocks within the {{BlockSender}} ctor.  Unfortunately it's
holding the exclusive dataset lock to open and read the metafile multiple times  Block sender
instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high xceiver activity.
 Ex. lost node replication, balancing, or decommissioning.  The xceiver threads congest creating
block senders and impair the heartbeat processing that is contending for the same lock.  Combined
with other lock contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message