Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 039F7200BC8 for ; Tue, 8 Nov 2016 22:35:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 02154160AD0; Tue, 8 Nov 2016 21:35:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F13FA160B0A for ; Tue, 8 Nov 2016 22:34:59 +0100 (CET) Received: (qmail 47888 invoked by uid 500); 8 Nov 2016 21:34:59 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 47865 invoked by uid 99); 8 Nov 2016 21:34:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Nov 2016 21:34:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id F312A2C1F54 for ; Tue, 8 Nov 2016 21:34:58 +0000 (UTC) Date: Tue, 8 Nov 2016 21:34:58 +0000 (UTC) From: "Wei-Chiu Chuang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-11056) Concurrent append and read operations lead to checksum error MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 08 Nov 2016 21:35:01 -0000 [ https://issues.apache.org/jira/browse/HDFS-11056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-11056: ----------------------------------- Attachment: HDFS-11056.branch-2.patch Attach branch-2 patch for precommit check > Concurrent append and read operations lead to checksum error > ------------------------------------------------------------ > > Key: HDFS-11056 > URL: https://issues.apache.org/jira/browse/HDFS-11056 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, httpfs > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Attachments: HDFS-11056.001.patch, HDFS-11056.002.patch, HDFS-11056.branch-2.patch, HDFS-11056.reproduce.patch > > > If there are two clients, one of them open-append-close a file continuously, while the other open-read-close the same file continuously, the reader eventually gets a checksum error in the data read. > On my local Mac, it takes a few minutes to produce the error. This happens to httpfs clients, but there's no reason not believe this happens to any append clients. > I have a unit test that demonstrates the checksum error. Will attach later. > Relevant log: > {quote} > 2016-10-25 15:34:45,153 INFO audit - allowed=true ugi=weichiu (auth:SIMPLE) ip=/127.0.0.1 cmd=open src=/tmp/bar.txt dst=null perm=null proto=rpc > 2016-10-25 15:34:45,155 INFO DataNode - Receiving BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 src: /127.0.0.1:51130 dest: /127.0.0.1:50131 > 2016-10-25 15:34:45,155 INFO FsDatasetImpl - Appending to FinalizedReplica, blk_1073741825_1182, FINALIZED > getNumBytes() = 182 > getBytesOnDisk() = 182 > getVisibleLength()= 182 > getVolume() = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1 > getBlockURI() = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-837130339-172.16.1.88-1477434851452/current/finalized/subdir0/subdir0/blk_1073741825 > 2016-10-25 15:34:45,167 INFO DataNode - opReadBlock BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 received exception java.io.IOException: No data exists for block BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 > 2016-10-25 15:34:45,167 WARN DataNode - DatanodeRegistration(127.0.0.1:50131, datanodeUuid=41c96335-5e4b-4950-ac22-3d21b353abb8, infoPort=50133, infoSecurePort=0, ipcPort=50134, storageInfo=lv=-57;cid=testClusterID;nsid=1472068852;c=1477434851452):Got exception while serving BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 to /127.0.0.1:51121 > java.io.IOException: No data exists for block BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockInputStream(FsDatasetImpl.java:773) > at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:400) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:581) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:150) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:102) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289) > at java.lang.Thread.run(Thread.java:745) > 2016-10-25 15:34:45,168 INFO FSNamesystem - updatePipeline(blk_1073741825_1182, newGS=1183, newLength=182, newNodes=[127.0.0.1:50131], client=DFSClient_NONMAPREDUCE_-1743096965_197) > 2016-10-25 15:34:45,168 ERROR DataNode - 127.0.0.1:50131:DataXceiver error processing READ_BLOCK operation src: /127.0.0.1:51121 dst: /127.0.0.1:50131 > java.io.IOException: No data exists for block BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockInputStream(FsDatasetImpl.java:773) > at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:400) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:581) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:150) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:102) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289) > at java.lang.Thread.run(Thread.java:745) > 2016-10-25 15:34:45,168 INFO FSNamesystem - updatePipeline(blk_1073741825_1182 => blk_1073741825_1183) success > 2016-10-25 15:34:45,170 WARN DFSClient - Found Checksum error for BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 from DatanodeInfoWithStorage[127.0.0.1:50131,DS-a1878418-4f7f-4fc9-b3f7-d7ed780b5373,DISK] at 0 > 2016-10-25 15:34:45,170 WARN DFSClient - No live nodes contain block BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 after checking nodes = [DatanodeInfoWithStorage[127.0.0.1:50131,DS-a1878418-4f7f-4fc9-b3f7-d7ed780b5373,DISK]], ignoredNodes = null > 2016-10-25 15:34:45,170 INFO DFSClient - Could not obtain BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 from any node: No live nodes contain current block Block locations: DatanodeInfoWithStorage[127.0.0.1:50131,DS-a1878418-4f7f-4fc9-b3f7-d7ed780b5373,DISK] Dead nodes: DatanodeInfoWithStorage[127.0.0.1:50131,DS-a1878418-4f7f-4fc9-b3f7-d7ed780b5373,DISK]. Will get new block locations from namenode and retry... > 2016-10-25 15:34:45,170 WARN DFSClient - DFS chooseDataNode: got # 1 IOException, will wait for 981.8085941094539 msec. > 2016-10-25 15:34:45,171 INFO clienttrace - src: /127.0.0.1:51130, dest: /127.0.0.1:50131, bytes: 183, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1743096965_197, offset: 0, srvID: 41c96335-5e4b-4950-ac22-3d21b353abb8, blockid: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1183, duration: 2175363 > 2016-10-25 15:34:45,171 INFO DataNode - PacketResponder: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1183, type=LAST_IN_PIPELINE terminating > 2016-10-25 15:34:45,172 INFO FSNamesystem - BLOCK* blk_1073741825_1183 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /tmp/bar.txt > 2016-10-25 15:34:45,576 INFO StateChange - DIR* completeFile: /tmp/bar.txt is closed by DFSClient_NONMAPREDUCE_-1743096965_197 > 2016-10-25 15:34:45,577 INFO httpfsaudit - [/tmp/bar.txt] > 2016-10-25 15:34:45,579 INFO AppendTestUtil - seed=-3144873070946578911, size=1 > 2016-10-25 15:34:45,590 INFO audit - allowed=true ugi=weichiu (auth:PROXY) via weichiu (auth:SIMPLE) ip=/127.0.0.1 cmd=append src=/tmp/bar.txt dst=null perm=null proto=rpc > 2016-10-25 15:34:45,593 INFO DataNode - Receiving BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1183 src: /127.0.0.1:51132 dest: /127.0.0.1:50131 > 2016-10-25 15:34:45,593 INFO FsDatasetImpl - Appending to FinalizedReplica, blk_1073741825_1183, FINALIZED > getNumBytes() = 183 > getBytesOnDisk() = 183 > getVisibleLength()= 183 > getVolume() = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1 > getBlockURI() = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-837130339-172.16.1.88-1477434851452/current/finalized/subdir0/subdir0/blk_1073741825 > 2016-10-25 15:34:45,603 INFO FSNamesystem - updatePipeline(blk_1073741825_1183, newGS=1184, newLength=183, newNodes=[127.0.0.1:50131], client=DFSClient_NONMAPREDUCE_-1743096965_197) > 2016-10-25 15:34:45,603 INFO FSNamesystem - updatePipeline(blk_1073741825_1183 => blk_1073741825_1184) success > 2016-10-25 15:34:45,605 INFO clienttrace - src: /127.0.0.1:51132, dest: /127.0.0.1:50131, bytes: 184, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1743096965_197, offset: 0, srvID: 41c96335-5e4b-4950-ac22-3d21b353abb8, blockid: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1184, duration: 1377229 > 2016-10-25 15:34:45,605 INFO DataNode - PacketResponder: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1184, type=LAST_IN_PIPELINE terminating > 2016-10-25 15:34:45,606 INFO FSNamesystem - BLOCK* blk_1073741825_1184 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /tmp/bar.txt > 2016-10-25 15:34:46,009 INFO StateChange - DIR* completeFile: /tmp/bar.txt is closed by DFSClient_NONMAPREDUCE_-1743096965_197 > 2016-10-25 15:34:46,010 INFO httpfsaudit - [/tmp/bar.txt] > 2016-10-25 15:34:46,012 INFO AppendTestUtil - seed=-263001291976323720, size=1 > 2016-10-25 15:34:46,022 INFO audit - allowed=true ugi=weichiu (auth:PROXY) via weichiu (auth:SIMPLE) ip=/127.0.0.1 cmd=append src=/tmp/bar.txt dst=null perm=null proto=rpc > 2016-10-25 15:34:46,024 INFO DataNode - Receiving BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1184 src: /127.0.0.1:51133 dest: /127.0.0.1:50131 > 2016-10-25 15:34:46,024 INFO FsDatasetImpl - Appending to FinalizedReplica, blk_1073741825_1184, FINALIZED > getNumBytes() = 184 > getBytesOnDisk() = 184 > getVisibleLength()= 184 > getVolume() = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1 > getBlockURI() = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-837130339-172.16.1.88-1477434851452/current/finalized/subdir0/subdir0/blk_1073741825 > 2016-10-25 15:34:46,032 INFO FSNamesystem - updatePipeline(blk_1073741825_1184, newGS=1185, newLength=184, newNodes=[127.0.0.1:50131], client=DFSClient_NONMAPREDUCE_-1743096965_197) > 2016-10-25 15:34:46,032 INFO FSNamesystem - updatePipeline(blk_1073741825_1184 => blk_1073741825_1185) success > 2016-10-25 15:34:46,033 INFO clienttrace - src: /127.0.0.1:51133, dest: /127.0.0.1:50131, bytes: 185, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1743096965_197, offset: 0, srvID: 41c96335-5e4b-4950-ac22-3d21b353abb8, blockid: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1185, duration: 1112564 > 2016-10-25 15:34:46,033 INFO DataNode - PacketResponder: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1185, type=LAST_IN_PIPELINE terminating > 2016-10-25 15:34:46,033 INFO FSNamesystem - BLOCK* blk_1073741825_1185 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /tmp/bar.txt > 2016-10-25 15:34:46,156 INFO audit - allowed=true ugi=weichiu (auth:SIMPLE) ip=/127.0.0.1 cmd=open src=/tmp/bar.txt dst=null perm=null proto=rpc > 2016-10-25 15:34:46,158 INFO StateChange - *DIR* reportBadBlocks for block: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 on datanode: 127.0.0.1:50131 > Exception in thread "Thread-144" java.lang.RuntimeException: org.apache.hadoop.fs.ChecksumException: Checksum CRC32C not matched for file /tmp/bar.txt at position 0: expected=C893FEDE but computed=69322F90, algorithm=PureJavaCrc32C > at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1ReaderRunnable.run(BaseTestHttpFSWith.java:309) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.ChecksumException: Checksum CRC32C not matched for file /tmp/bar.txt at position 0: expected=C893FEDE but computed=69322F90, algorithm=PureJavaCrc32C > at org.apache.hadoop.util.DataChecksum.throwChecksumException(DataChecksum.java:407) > at org.apache.hadoop.util.DataChecksum.verifyChunked(DataChecksum.java:351) > at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:311) > at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.readNextPacket(BlockReaderRemote.java:216) > at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.read(BlockReaderRemote.java:144) > at org.apache.hadoop.hdfs.ByteArrayStrategy.readFromBlock(ReaderStrategy.java:119) > at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:704) > at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:765) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:814) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1ReaderRunnable.run(BaseTestHttpFSWith.java:302) > ... 1 more > 2016-10-25 15:34:46,437 INFO StateChange - DIR* completeFile: /tmp/bar.txt is closed by DFSClient_NONMAPREDUCE_-1743096965_197 > 2016-10-25 15:34:46,437 INFO httpfsaudit - [/tmp/bar.txt] > 2016-10-25 15:34:46,440 INFO AppendTestUtil - seed=8756761565208093670, size=1 > 2016-10-25 15:34:46,450 WARN StateChange - DIR* NameSystem.append: append: lastBlock=blk_1073741825_1185 of src=/tmp/bar.txt is not sufficiently replicated yet. > 2016-10-25 15:34:46,450 INFO Server - IPC Server handler 7 on 50130, call Call#25082 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.append from 127.0.0.1:50147 > java.io.IOException: append: lastBlock=blk_1073741825_1185 of src=/tmp/bar.txt is not sufficiently replicated yet. > at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:136) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2423) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:773) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:444) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:467) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:990) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1795) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2535) > Exception in thread "Thread-143" java.lang.RuntimeException: java.io.IOException: HTTP status [500], exception [org.apache.hadoop.ipc.RemoteException], message [append: lastBlock=blk_1073741825_1185 of src=/tmp/bar.txt is not sufficiently replicated yet.] > at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1.run(BaseTestHttpFSWith.java:283) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: HTTP status [500], exception [org.apache.hadoop.ipc.RemoteException], message [append: lastBlock=blk_1073741825_1185 of src=/tmp/bar.txt is not sufficiently replicated yet.] > at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:159) > at org.apache.hadoop.fs.http.client.HttpFSFileSystem$HttpFSDataOutputStream.close(HttpFSFileSystem.java:470) > at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1.run(BaseTestHttpFSWith.java:279) > ... 1 more > org.apache.hadoop.fs.ChecksumException: Checksum CRC32C not matched for file /tmp/bar.txt at position 0: expected=C893FEDE but computed=69322F90, algorithm=PureJavaCrc32C > at org.apache.hadoop.util.DataChecksum.throwChecksumException(DataChecksum.java:407) > at org.apache.hadoop.util.DataChecksum.verifyChunked(DataChecksum.java:351) > at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:311) > at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.readNextPacket(BlockReaderRemote.java:216) > at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.read(BlockReaderRemote.java:144) > at org.apache.hadoop.hdfs.ByteArrayStrategy.readFromBlock(ReaderStrategy.java:119) > at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:704) > at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:765) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:814) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1ReaderRunnable.run(BaseTestHttpFSWith.java:302) > at java.lang.Thread.run(Thread.java:745) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org