hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13111) Close recovery may incorrectly mark blocks corrupt
Date Tue, 06 Feb 2018 16:11:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354076#comment-16354076
] 

Kihwal Lee commented on HDFS-13111:
-----------------------------------

Here is relevant log lines of the latest example. This node was under a heavy I/O load.

{noformat}
2018-02-06 00:00:03,413 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] INFO
 Receiving BP-YYY:blk_7654321_1234567 src: /1.2.3.4:57710 dest: /1.2.3.5:1004
2018-02-06 00:09:58,840 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN
 Slow BlockReceiver write data to disk cost:462ms (threshold=300ms)
2018-02-06 00:10:40,148 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN
 Slow BlockReceiver write data to disk cost:11155ms (threshold=300ms)
2018-02-06 00:10:46,053 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN
 Slow BlockReceiver write data to disk cost:1577ms (threshold=300ms)
2018-02-06 00:11:02,376 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN
 Slow BlockReceiver write data to disk cost:327ms (threshold=300ms)
2018-02-06 00:11:53,064 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] INFO
 Receiving BP-YYY:blk_7654321_1234567 src: /1.2.3.4:40532 dest: /1.2.3.5:1004
2018-02-06 00:12:09,782 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] INFO
 Recover failed close BP-YYY:blk_7654321_1234567
2018-02-06 00:12:13,081 [DataXceiver for client DFSClient_XXX at /1.2.3.7:46522] INFO
 Receiving BP-YYY:blk_7654321_1234567 src: /1.2.3.7:46522 dest: /1.2.3.5:1004
2018-02-06 00:12:13,081 [DataXceiver for client DFSClient_XXX at /1.2.3.7:46522] INFO
 Recover failed close BP-YYY:blk_7654321_1234567
2018-02-06 00:12:17,276 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] WARN
 Lock held time above threshold: lock identifier: org.apache.hadoop.hdfs.server.datanode
 .fsdataset.impl.FsDatasetImpl lockHeldTimeMs=7492 ms. Suppressed 0 lock warnings.
 The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1556)
 ... // it was recoverClose()
2018-02-06 00:12:17,276 [DataXceiver for client DFSClient_XXX at /1.2.3.7:46522] INFO
 Received BP-YYY:blk_7654321_1135832806836 src: /1.2.3.7:46522 dest: /1.2.3.5:1004 of size
xx
2018-02-06 00:12:20,103 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] INFO
 Received BP-YYY:blk_7654321_1135832805246 src: /1.2.3.4:40532 dest: /1.2.3.5:1004 of size
xx
2018-02-06 00:12:38,353 [PacketResponder: BP-YYY:blk_7654321_1234567, type=LAST_IN_PIPELINE]
INFO
 DataNode.clienttrace: src: /1.2.3.4:57710, dest: /1.2.3.5:1004, bytes: 134217728, op: HDFS_WRITE,
 cliID: DFSClient_XXX, offset: 0, srvID: ZZZ, blockid: BP-YYY:blk_7654321_1234567, duration:
looong
{noformat}

Note the client port number to identify each writer thread. After two "successful" {{recoverClose()}},
the original writer comes around and also declares a success. This must have resulted in the
reported gen stamp going backward. On disk actually was the latest one.

This clearly illustrates that it is wrong to time out on the writer termination and continue
with the recovery.

> Close recovery may incorrectly mark blocks corrupt
> --------------------------------------------------
>
>                 Key: HDFS-13111
>                 URL: https://issues.apache.org/jira/browse/HDFS-13111
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Daryn Sharp
>            Priority: Critical
>
> Close recovery can leave a block marked corrupt until the next FBR arrives from one of
the DNs.  The reason is unclear but has happened multiple times when a DN has io saturated
disks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message