Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 20501184D9 for ; Mon, 14 Mar 2016 13:28:34 +0000 (UTC) Received: (qmail 65024 invoked by uid 500); 14 Mar 2016 13:28:33 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 64972 invoked by uid 500); 14 Mar 2016 13:28:33 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 64957 invoked by uid 99); 14 Mar 2016 13:28:33 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Mar 2016 13:28:33 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 8DEB62C14FB for ; Mon, 14 Mar 2016 13:28:33 +0000 (UTC) Date: Mon, 14 Mar 2016 13:28:33 +0000 (UTC) From: "David Watzke (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-9955) DataNode won't self-heal after some block dirs were manually misplaced MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-9955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Watzke updated HDFS-9955: ------------------------------- Description: I have accidentally ran this tool on top of DataNode's datadirs (of a datanode that was shut down at the moment): https://github.com/killerwhile/volume-balancer The tool makes assumptions about block directory placement that are no longer valid in hadoop 2.6.0 and it was just moving them around between different datadirs to make the disk usage balanced. OK, it was not a good idea to run it but my concern is the way the datanode was (not) handling the resulting state. I've seen these messages in DN log (see below) which means DN knew about this but didn't do anything to fix it (self-heal by copying the other replica) - which seems like a bug to me. If you need any additional info please just ask. {noformat} 2016-03-04 12:40:06,008 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_-3159875140074863904_0 on volume /data/18/cdfs/dn 2016-03-04 12:40:06,009 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_8369468090548520777_0 on volume /data/18/cdfs/dn 2016-03-04 12:40:06,011 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_1226431637_0 on volume /data/18/cdfs/dn 2016-03-04 12:40:06,012 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_1169332185_0 on volume /data/18/cdfs/dn 2016-03-04 12:40:06,825 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opReadBlock BP-680964103-77.234.46.18-1375882473930:blk_1226781281_1099829669050 received exception java.io.IOException: BlockId 1226781281 is not valid. 2016-03-04 12:40:06,825 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(5.45.56.30, datanodeUuid=9da950ca-87ae-44ee-9391-0bca669c796b, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=cluster12;nsid=1625487778;c=1438754073236):Got exception while serving BP-680964103-77.234.46.18-1375882473930:blk_1226781281_1099829669050 to /5.45.56.30:48146 java.io.IOException: BlockId 1226781281 is not valid. at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:650) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:641) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getMetaDataInputStream(FsDatasetImpl.java:214) at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:282) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:529) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243) at java.lang.Thread.run(Thread.java:745) 2016-03-04 12:40:06,826 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: prg04-002.ff.avast.com:50010:DataXceiver error processing READ_BLOCK operation src: /5.45.56.30:48146 dst: /5.45.56.30:50010 java.io.IOException: BlockId 1226781281 is not valid. at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:650) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:641) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getMetaDataInputStream(FsDatasetImpl.java:214) at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:282) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:529) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243) at java.lang.Thread.run(Thread.java:745) {noformat} was: I have accidentally ran this tool on top of DataNode's datadirs (of a datanode that was shut down at the moment): https://github.com/killerwhile/volume-balancer The tool makes assumptions about block directory placement that are no longer valid in hadoop 2.6.0 and it was just moving them around between different datadirs to make the disk usage balanced. OK, it was not a good idea to run it but my concern is the way the datanode was (not) handling the resulting state. I've seen these messages in DN log (see below) which means DN knew about this but didn't do anything to fix it (self-heal by copying the other replica) - which seems like a bug to me. If you need any additional info please just ask. 2016-03-04 12:40:06,008 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_-3159875140074863904_0 on volume /data/18/cdfs/dn 2016-03-04 12:40:06,009 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_8369468090548520777_0 on volume /data/18/cdfs/dn 2016-03-04 12:40:06,011 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_1226431637_0 on volume /data/18/cdfs/dn 2016-03-04 12:40:06,012 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_1169332185_0 on volume /data/18/cdfs/dn 2016-03-04 12:40:06,825 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opReadBlock BP-680964103-77.234.46.18-1375882473930:blk_1226781281_1099829669050 received exception java.io.IOException: BlockId 1226781281 is not valid. 2016-03-04 12:40:06,825 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(5.45.56.30, datanodeUuid=9da950ca-87ae-44ee-9391-0bca669c796b, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=cluster12;nsid=1625487778;c=1438754073236):Got exception while serving BP-680964103-77.234.46.18-1375882473930:blk_1226781281_1099829669050 to /5.45.56.30:48146 java.io.IOException: BlockId 1226781281 is not valid. at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:650) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:641) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getMetaDataInputStream(FsDatasetImpl.java:214) at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:282) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:529) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243) at java.lang.Thread.run(Thread.java:745) 2016-03-04 12:40:06,826 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: prg04-002.ff.avast.com:50010:DataXceiver error processing READ_BLOCK operation src: /5.45.56.30:48146 dst: /5.45.56.30:50010 java.io.IOException: BlockId 1226781281 is not valid. at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:650) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:641) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getMetaDataInputStream(FsDatasetImpl.java:214) at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:282) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:529) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243) at java.lang.Thread.run(Thread.java:745) > DataNode won't self-heal after some block dirs were manually misplaced > ---------------------------------------------------------------------- > > Key: HDFS-9955 > URL: https://issues.apache.org/jira/browse/HDFS-9955 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.6.0 > Environment: CentOS 6, Cloudera 5.4.4 (patched Hadoop 2.6.0) > Reporter: David Watzke > Labels: data-integrity > > I have accidentally ran this tool on top of DataNode's datadirs (of a datanode that was shut down at the moment): https://github.com/killerwhile/volume-balancer > The tool makes assumptions about block directory placement that are no longer valid in hadoop 2.6.0 and it was just moving them around between different datadirs to make the disk usage balanced. OK, it was not a good idea to run it but my concern is the way the datanode was (not) handling the resulting state. I've seen these messages in DN log (see below) which means DN knew about this but didn't do anything to fix it (self-heal by copying the other replica) - which seems like a bug to me. If you need any additional info please just ask. > {noformat} > 2016-03-04 12:40:06,008 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_-3159875140074863904_0 on volume /data/18/cdfs/dn > 2016-03-04 12:40:06,009 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_8369468090548520777_0 on volume /data/18/cdfs/dn > 2016-03-04 12:40:06,011 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_1226431637_0 on volume /data/18/cdfs/dn > 2016-03-04 12:40:06,012 WARN org.apache.hadoop.hdfs.server.datanode.VolumeScanner: I/O error while finding block BP-680964103-77.234.46.18-1375882473930:blk_1169332185_0 on volume /data/18/cdfs/dn > 2016-03-04 12:40:06,825 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opReadBlock BP-680964103-77.234.46.18-1375882473930:blk_1226781281_1099829669050 received exception java.io.IOException: BlockId 1226781281 is not valid. > 2016-03-04 12:40:06,825 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(5.45.56.30, datanodeUuid=9da950ca-87ae-44ee-9391-0bca669c796b, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=cluster12;nsid=1625487778;c=1438754073236):Got exception while serving BP-680964103-77.234.46.18-1375882473930:blk_1226781281_1099829669050 to /5.45.56.30:48146 > java.io.IOException: BlockId 1226781281 is not valid. > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:650) > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:641) > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getMetaDataInputStream(FsDatasetImpl.java:214) > at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:282) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:529) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243) > at java.lang.Thread.run(Thread.java:745) > 2016-03-04 12:40:06,826 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: prg04-002.ff.avast.com:50010:DataXceiver error processing READ_BLOCK operation src: /5.45.56.30:48146 dst: /5.45.56.30:50010 > java.io.IOException: BlockId 1226781281 is not valid. > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:650) > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:641) > at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getMetaDataInputStream(FsDatasetImpl.java:214) > at org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:282) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:529) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:243) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)