Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id AF4D9200C1E for ; Fri, 17 Feb 2017 08:07:51 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id ADE80160B55; Fri, 17 Feb 2017 07:07:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D2AE9160B3F for ; Fri, 17 Feb 2017 08:07:50 +0100 (CET) Received: (qmail 79878 invoked by uid 500); 17 Feb 2017 07:07:50 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 79867 invoked by uid 99); 17 Feb 2017 07:07:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Feb 2017 07:07:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 84CA3C263C for ; Fri, 17 Feb 2017 07:07:49 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id c3bqsVm1Vzsq for ; Fri, 17 Feb 2017 07:07:48 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id C87D25FB79 for ; Fri, 17 Feb 2017 07:07:47 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 043D2E039E for ; Fri, 17 Feb 2017 07:07:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id AFF702411F for ; Fri, 17 Feb 2017 07:07:41 +0000 (UTC) Date: Fri, 17 Feb 2017 07:07:41 +0000 (UTC) From: "Chen Zhang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-10927) Lease Recovery: File not getting closed on HDFS when block write operation fails MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 17 Feb 2017 07:07:51 -0000 [ https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871310#comment-15871310 ] Chen Zhang commented on HDFS-10927: ----------------------------------- Hey guys, I just found an issue like this. HBase RegionServer got an DiskFull exception while writing WAL files, and client failed after several times retry. When Master trying to use recoverLease to recover these file, we got almost same logs as [~ngoswami] attached {quote} java.io.IOException: File length mismatched. The length of /home/work/ssd11/hdfs/xxxx/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099 is 174432256 but r=ReplicaBeingWritten, blk_1073970099_229357, RBW getNumBytes() = 174437376 getBytesOnDisk() = 174429752 getVisibleLength()= 174429752 getVolume() = /home/work/ssd11/hdfs/xxxxx/datanode/current getBlockFile() = /home/work/ssd11/hdfs/xxxxx/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099 bytesAcked=174429752 bytesOnDisk=174429752 {quote} It's caused by the exception while out.write() in receivePacket() of BlockReceiver. receivePacket() first update numbytes in replicaInfo, then write data to disk, and update bytesOnDisk at last, the DiskFull exception makes the length not consistent. > Lease Recovery: File not getting closed on HDFS when block write operation fails > -------------------------------------------------------------------------------- > > Key: HDFS-10927 > URL: https://issues.apache.org/jira/browse/HDFS-10927 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs > Affects Versions: 2.7.1 > Reporter: Nitin Goswami > > HDFS was unable to close a file when block write operation failed because of too high disk usage. > Scenario: > HBase was writing WAL logs on HDFS and the disk usage was too high at that time. While writing these WAL logs, one of the blocks writes operation failed with the following exception: > 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899 > java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.194.144:50010 remote=/192.168.192.162:43105] > at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.BufferedInputStream.fill(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at java.io.DataInputStream.read(Unknown Source) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199) > at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) > at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) > at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) > at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472) > at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251) > at java.lang.Thread.run(Unknown Source) > After this exception, HBase tried to close/rollover the WAL file but that call also failed and WAL file couldn't be closed. After this HBase closed the region server > After some time, Lease Recovery got triggered for this file and following exceptions starts occurring: > 2016-09-13 11:51:11,743 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to obtain replica info for block (=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null]) > java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() < getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW > getNumBytes() = 45524696 > getBytesOnDisk() = 45483527 > getVisibleLength()= 45511557 > getVolume() = /opt/reflex/data/yarn/datanode/current > getBlockFile() = /opt/reflex/data/yarn/datanode/current/BP-337226066-192.168.193.217-1468912147102/current/rbw/blk_1074859607 > bytesAcked=45511557 > bytesOnDisk=45483527 > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2278) > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2254) > at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:2542) > at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) > at org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Unknown Source) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) > at org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:2555) > at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2625) > at org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:243) > at org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2527) > at java.lang.Thread.run(Unknown Source) > Expected Behaviour: Under all conditions lease recovery should have been done and file should have been closed. > Impact: Since the file couldn't be closed, HBase went into an incosistent state as it wasn't able to run through the WAL file after the region server restart. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org