hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11056) Concurrent append and read operations lead to checksum error
Date Wed, 09 Nov 2016 17:40:59 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651541#comment-15651541
] 

Hudson commented on HDFS-11056:
-------------------------------

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10802 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/10802/])
HDFS-11056. Concurrent append and read operations lead to checksum (weichiu: rev c619e9b43fd00ba0e59a98ae09685ff719bb722b)
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImplTestUtils.java
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileAppend.java


> Concurrent append and read operations lead to checksum error
> ------------------------------------------------------------
>
>                 Key: HDFS-11056
>                 URL: https://issues.apache.org/jira/browse/HDFS-11056
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, httpfs
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-11056.001.patch, HDFS-11056.002.patch, HDFS-11056.branch-2.patch,
HDFS-11056.reproduce.patch
>
>
> If there are two clients, one of them open-append-close a file continuously, while the
other open-read-close the same file continuously, the reader eventually gets a checksum error
in the data read.
> On my local Mac, it takes a few minutes to produce the error. This happens to httpfs
clients, but there's no reason not believe this happens to any append clients.
> I have a unit test that demonstrates the checksum error. Will attach later.
> Relevant log:
> {quote}
> 2016-10-25 15:34:45,153 INFO  audit - allowed=true	ugi=weichiu (auth:SIMPLE)	ip=/127.0.0.1
cmd=open	src=/tmp/bar.txt	dst=null	perm=null	proto=rpc
> 2016-10-25 15:34:45,155 INFO  DataNode - Receiving BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182
src: /127.0.0.1:51130 dest: /127.0.0.1:50131
> 2016-10-25 15:34:45,155 INFO  FsDatasetImpl - Appending to FinalizedReplica, blk_1073741825_1182,
FINALIZED
>   getNumBytes()     = 182
>   getBytesOnDisk()  = 182
>   getVisibleLength()= 182
>   getVolume()       = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1
>   getBlockURI()     = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-837130339-172.16.1.88-1477434851452/current/finalized/subdir0/subdir0/blk_1073741825
> 2016-10-25 15:34:45,167 INFO  DataNode - opReadBlock BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182
received exception java.io.IOException: No data exists for block BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182
> 2016-10-25 15:34:45,167 WARN  DataNode - DatanodeRegistration(127.0.0.1:50131, datanodeUuid=41c96335-5e4b-4950-ac22-3d21b353abb8,
infoPort=50133, infoSecurePort=0, ipcPort=50134, storageInfo=lv=-57;cid=testClusterID;nsid=1472068852;c=1477434851452):Got
exception while serving BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182 to /127.0.0.1:51121
> java.io.IOException: No data exists for block BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockInputStream(FsDatasetImpl.java:773)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:400)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:581)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:150)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:102)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289)
> 	at java.lang.Thread.run(Thread.java:745)
> 2016-10-25 15:34:45,168 INFO  FSNamesystem - updatePipeline(blk_1073741825_1182, newGS=1183,
newLength=182, newNodes=[127.0.0.1:50131], client=DFSClient_NONMAPREDUCE_-1743096965_197)
> 2016-10-25 15:34:45,168 ERROR DataNode - 127.0.0.1:50131:DataXceiver error processing
READ_BLOCK operation  src: /127.0.0.1:51121 dst: /127.0.0.1:50131
> java.io.IOException: No data exists for block BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockInputStream(FsDatasetImpl.java:773)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:400)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:581)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:150)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:102)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289)
> 	at java.lang.Thread.run(Thread.java:745)
> 2016-10-25 15:34:45,168 INFO  FSNamesystem - updatePipeline(blk_1073741825_1182 =>
blk_1073741825_1183) success
> 2016-10-25 15:34:45,170 WARN  DFSClient - Found Checksum error for BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182
from DatanodeInfoWithStorage[127.0.0.1:50131,DS-a1878418-4f7f-4fc9-b3f7-d7ed780b5373,DISK]
at 0
> 2016-10-25 15:34:45,170 WARN  DFSClient - No live nodes contain block BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182
after checking nodes = [DatanodeInfoWithStorage[127.0.0.1:50131,DS-a1878418-4f7f-4fc9-b3f7-d7ed780b5373,DISK]],
ignoredNodes = null
> 2016-10-25 15:34:45,170 INFO  DFSClient - Could not obtain BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182
from any node:  No live nodes contain current block Block locations: DatanodeInfoWithStorage[127.0.0.1:50131,DS-a1878418-4f7f-4fc9-b3f7-d7ed780b5373,DISK]
Dead nodes:  DatanodeInfoWithStorage[127.0.0.1:50131,DS-a1878418-4f7f-4fc9-b3f7-d7ed780b5373,DISK].
Will get new block locations from namenode and retry...
> 2016-10-25 15:34:45,170 WARN  DFSClient - DFS chooseDataNode: got # 1 IOException, will
wait for 981.8085941094539 msec.
> 2016-10-25 15:34:45,171 INFO  clienttrace - src: /127.0.0.1:51130, dest: /127.0.0.1:50131,
bytes: 183, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1743096965_197, offset: 0, srvID:
41c96335-5e4b-4950-ac22-3d21b353abb8, blockid: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1183,
duration: 2175363
> 2016-10-25 15:34:45,171 INFO  DataNode - PacketResponder: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1183,
type=LAST_IN_PIPELINE terminating
> 2016-10-25 15:34:45,172 INFO  FSNamesystem - BLOCK* blk_1073741825_1183 is COMMITTED
but not COMPLETE(numNodes= 0 <  minimum = 1) in file /tmp/bar.txt
> 2016-10-25 15:34:45,576 INFO  StateChange - DIR* completeFile: /tmp/bar.txt is closed
by DFSClient_NONMAPREDUCE_-1743096965_197
> 2016-10-25 15:34:45,577 INFO  httpfsaudit - [/tmp/bar.txt]
> 2016-10-25 15:34:45,579 INFO  AppendTestUtil - seed=-3144873070946578911, size=1
> 2016-10-25 15:34:45,590 INFO  audit - allowed=true	ugi=weichiu (auth:PROXY) via weichiu
(auth:SIMPLE)	ip=/127.0.0.1	cmd=append	src=/tmp/bar.txt	dst=null	perm=null	proto=rpc
> 2016-10-25 15:34:45,593 INFO  DataNode - Receiving BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1183
src: /127.0.0.1:51132 dest: /127.0.0.1:50131
> 2016-10-25 15:34:45,593 INFO  FsDatasetImpl - Appending to FinalizedReplica, blk_1073741825_1183,
FINALIZED
>   getNumBytes()     = 183
>   getBytesOnDisk()  = 183
>   getVisibleLength()= 183
>   getVolume()       = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1
>   getBlockURI()     = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-837130339-172.16.1.88-1477434851452/current/finalized/subdir0/subdir0/blk_1073741825
> 2016-10-25 15:34:45,603 INFO  FSNamesystem - updatePipeline(blk_1073741825_1183, newGS=1184,
newLength=183, newNodes=[127.0.0.1:50131], client=DFSClient_NONMAPREDUCE_-1743096965_197)
> 2016-10-25 15:34:45,603 INFO  FSNamesystem - updatePipeline(blk_1073741825_1183 =>
blk_1073741825_1184) success
> 2016-10-25 15:34:45,605 INFO  clienttrace - src: /127.0.0.1:51132, dest: /127.0.0.1:50131,
bytes: 184, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1743096965_197, offset: 0, srvID:
41c96335-5e4b-4950-ac22-3d21b353abb8, blockid: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1184,
duration: 1377229
> 2016-10-25 15:34:45,605 INFO  DataNode - PacketResponder: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1184,
type=LAST_IN_PIPELINE terminating
> 2016-10-25 15:34:45,606 INFO  FSNamesystem - BLOCK* blk_1073741825_1184 is COMMITTED
but not COMPLETE(numNodes= 0 <  minimum = 1) in file /tmp/bar.txt
> 2016-10-25 15:34:46,009 INFO  StateChange - DIR* completeFile: /tmp/bar.txt is closed
by DFSClient_NONMAPREDUCE_-1743096965_197
> 2016-10-25 15:34:46,010 INFO  httpfsaudit - [/tmp/bar.txt]
> 2016-10-25 15:34:46,012 INFO  AppendTestUtil - seed=-263001291976323720, size=1
> 2016-10-25 15:34:46,022 INFO  audit - allowed=true	ugi=weichiu (auth:PROXY) via weichiu
(auth:SIMPLE)	ip=/127.0.0.1	cmd=append	src=/tmp/bar.txt	dst=null	perm=null	proto=rpc
> 2016-10-25 15:34:46,024 INFO  DataNode - Receiving BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1184
src: /127.0.0.1:51133 dest: /127.0.0.1:50131
> 2016-10-25 15:34:46,024 INFO  FsDatasetImpl - Appending to FinalizedReplica, blk_1073741825_1184,
FINALIZED
>   getNumBytes()     = 184
>   getBytesOnDisk()  = 184
>   getVisibleLength()= 184
>   getVolume()       = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1
>   getBlockURI()     = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-837130339-172.16.1.88-1477434851452/current/finalized/subdir0/subdir0/blk_1073741825
> 2016-10-25 15:34:46,032 INFO  FSNamesystem - updatePipeline(blk_1073741825_1184, newGS=1185,
newLength=184, newNodes=[127.0.0.1:50131], client=DFSClient_NONMAPREDUCE_-1743096965_197)
> 2016-10-25 15:34:46,032 INFO  FSNamesystem - updatePipeline(blk_1073741825_1184 =>
blk_1073741825_1185) success
> 2016-10-25 15:34:46,033 INFO  clienttrace - src: /127.0.0.1:51133, dest: /127.0.0.1:50131,
bytes: 185, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1743096965_197, offset: 0, srvID:
41c96335-5e4b-4950-ac22-3d21b353abb8, blockid: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1185,
duration: 1112564
> 2016-10-25 15:34:46,033 INFO  DataNode - PacketResponder: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1185,
type=LAST_IN_PIPELINE terminating
> 2016-10-25 15:34:46,033 INFO  FSNamesystem - BLOCK* blk_1073741825_1185 is COMMITTED
but not COMPLETE(numNodes= 0 <  minimum = 1) in file /tmp/bar.txt
> 2016-10-25 15:34:46,156 INFO  audit - allowed=true	ugi=weichiu (auth:SIMPLE)	ip=/127.0.0.1
cmd=open	src=/tmp/bar.txt	dst=null	perm=null	proto=rpc
> 2016-10-25 15:34:46,158 INFO  StateChange - *DIR* reportBadBlocks for block: BP-837130339-172.16.1.88-1477434851452:blk_1073741825_1182
on datanode: 127.0.0.1:50131
> Exception in thread "Thread-144" java.lang.RuntimeException: org.apache.hadoop.fs.ChecksumException:
Checksum CRC32C not matched for file /tmp/bar.txt at position 0: expected=C893FEDE but computed=69322F90,
algorithm=PureJavaCrc32C
> 	at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1ReaderRunnable.run(BaseTestHttpFSWith.java:309)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.ChecksumException: Checksum CRC32C not matched for file
/tmp/bar.txt at position 0: expected=C893FEDE but computed=69322F90, algorithm=PureJavaCrc32C
> 	at org.apache.hadoop.util.DataChecksum.throwChecksumException(DataChecksum.java:407)
> 	at org.apache.hadoop.util.DataChecksum.verifyChunked(DataChecksum.java:351)
> 	at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:311)
> 	at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.readNextPacket(BlockReaderRemote.java:216)
> 	at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.read(BlockReaderRemote.java:144)
> 	at org.apache.hadoop.hdfs.ByteArrayStrategy.readFromBlock(ReaderStrategy.java:119)
> 	at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:704)
> 	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:765)
> 	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:814)
> 	at java.io.DataInputStream.read(DataInputStream.java:149)
> 	at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1ReaderRunnable.run(BaseTestHttpFSWith.java:302)
> 	... 1 more
> 2016-10-25 15:34:46,437 INFO  StateChange - DIR* completeFile: /tmp/bar.txt is closed
by DFSClient_NONMAPREDUCE_-1743096965_197
> 2016-10-25 15:34:46,437 INFO  httpfsaudit - [/tmp/bar.txt]
> 2016-10-25 15:34:46,440 INFO  AppendTestUtil - seed=8756761565208093670, size=1
> 2016-10-25 15:34:46,450 WARN  StateChange - DIR* NameSystem.append: append: lastBlock=blk_1073741825_1185
of src=/tmp/bar.txt is not sufficiently replicated yet.
> 2016-10-25 15:34:46,450 INFO  Server - IPC Server handler 7 on 50130, call Call#25082
Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.append from 127.0.0.1:50147
> java.io.IOException: append: lastBlock=blk_1073741825_1185 of src=/tmp/bar.txt is not
sufficiently replicated yet.
> 	at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:136)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2423)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:773)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:444)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:467)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:990)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:845)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:788)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1795)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2535)
> Exception in thread "Thread-143" java.lang.RuntimeException: java.io.IOException: HTTP
status [500], exception [org.apache.hadoop.ipc.RemoteException], message [append: lastBlock=blk_1073741825_1185
of src=/tmp/bar.txt is not sufficiently replicated yet.] 
> 	at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1.run(BaseTestHttpFSWith.java:283)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: HTTP status [500], exception [org.apache.hadoop.ipc.RemoteException],
message [append: lastBlock=blk_1073741825_1185 of src=/tmp/bar.txt is not sufficiently replicated
yet.] 
> 	at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:159)
> 	at org.apache.hadoop.fs.http.client.HttpFSFileSystem$HttpFSDataOutputStream.close(HttpFSFileSystem.java:470)
> 	at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1.run(BaseTestHttpFSWith.java:279)
> 	... 1 more
> org.apache.hadoop.fs.ChecksumException: Checksum CRC32C not matched for file /tmp/bar.txt
at position 0: expected=C893FEDE but computed=69322F90, algorithm=PureJavaCrc32C
> 	at org.apache.hadoop.util.DataChecksum.throwChecksumException(DataChecksum.java:407)
> 	at org.apache.hadoop.util.DataChecksum.verifyChunked(DataChecksum.java:351)
> 	at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:311)
> 	at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.readNextPacket(BlockReaderRemote.java:216)
> 	at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.read(BlockReaderRemote.java:144)
> 	at org.apache.hadoop.hdfs.ByteArrayStrategy.readFromBlock(ReaderStrategy.java:119)
> 	at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:704)
> 	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:765)
> 	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:814)
> 	at java.io.DataInputStream.read(DataInputStream.java:149)
> 	at org.apache.hadoop.fs.http.client.BaseTestHttpFSWith$1ReaderRunnable.run(BaseTestHttpFSWith.java:302)
> 	at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message