Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D95986CB7 for ; Mon, 20 Jun 2011 10:50:39 +0000 (UTC) Received: (qmail 40222 invoked by uid 500); 20 Jun 2011 10:50:39 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 40109 invoked by uid 500); 20 Jun 2011 10:50:38 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 39992 invoked by uid 99); 20 Jun 2011 10:50:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jun 2011 10:50:38 +0000 X-ASF-Spam-Status: No, hits=3.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tivv00@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jun 2011 10:50:30 +0000 Received: by qwj9 with SMTP id 9so493319qwj.35 for ; Mon, 20 Jun 2011 03:50:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=c87IUNRKKhY1EJ2mX9zeW1FJFhbBk5B1zU+lQZ74kCk=; b=ae+ZRcotX4YYy7T78gLn0d4rBv9m0P8Nnkvcevds8HKNcYQX5MvW6QG92TIq3R9jec 078PqUuFvsSpcPMDXiRJC/wU1R5i3qmiee7WyHkC8M38N7GlKuqXFTJsonER3CC9jPXb naXbjUcnjhC9R7O3nu+F2BhiEPhj/6dSD3SE8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=H2RgFUbdL0UCpVy07YyVkUlF+f8G6/ofTqPmHgUt+OAXS9ymIEDn1asCQzHYPa1tWC xzC9861TYE/rPC/r/bf9pXqpY1ht5iKBavvjNJ7JXAccSm9JZVmdfCeOOcDrrPeZ1oS4 uWs/8zRZXwSyCRrfVit6udtg7gwN2Nt1i6bMc= MIME-Version: 1.0 Received: by 10.229.75.196 with SMTP id z4mr3764241qcj.277.1308567009609; Mon, 20 Jun 2011 03:50:09 -0700 (PDT) Received: by 10.229.74.196 with HTTP; Mon, 20 Jun 2011 03:50:09 -0700 (PDT) In-Reply-To: <4DFB67D7.7000207@gmail.com> References: <4DFB67D7.7000207@gmail.com> Date: Mon, 20 Jun 2011 13:50:09 +0300 Message-ID: Subject: Data node check dir storm From: =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= To: hdfs-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=00235429de6cc5c85204a6227f8e X-Virus-Checked: Checked by ClamAV on apache.org --00235429de6cc5c85204a6227f8e Content-Type: text/plain; charset=ISO-8859-1 Hello. I am using Hadoop 0.21 I can see that if data node receives some IO error, this can cause checkDir storm. What I mean: 1) any error produces DataNode.checkDiskError call 2) this call locks volume: java.lang.Thread.State: RUNNABLE at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228) at java.io.File.exists(File.java:733) at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:65) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:86) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:228) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.checkDirs(FSDataset.java:414) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:617) - locked <0x000000080a8faec0> (a org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet) at org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDataDir(FSDataset.java:1681) at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:745) at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:735) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:202) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151) at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111) at java.lang.Thread.run(Thread.java:619) 3) This produces timeouts on other calls, e.g. 2011-06-17 17:35:03,922 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: java.io.InterruptedIOException at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:183) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151) at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111) at java.lang.Thread.run(Thread.java:619) 4) This, in turn, produces more "dir check calls". 5) All the cluster works very slow because of half-working node. -- Best regards, Vitalii Tymchyshyn --00235429de6cc5c85204a6227f8e Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hello.

I am using Hadoop 0.21
I can see that if data node receives some IO error, this can cause checkDir= storm.
What I mean:
1) any error produces DataNode.checkDiskError call
2) this call locks volume:
=A0java.lang.Thread.State: RUNNABLE
=A0 =A0 =A0 =A0at java.io.UnixFileSystem.getBooleanAttributes0(Native Meth= od)
=A0 =A0 =A0 =A0at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSyst= em.java:228)
=A0 =A0 =A0 =A0at java.io.File.exists(File.java:733)
=A0 =A0 =A0 =A0at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck= (DiskChecker.java:65)
=A0 =A0 =A0 =A0at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.= java:86)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.c= heckDirTree(FSDataset.java:228)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.c= heckDirTree(FSDataset.java:232)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.c= heckDirTree(FSDataset.java:232)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.c= heckDirTree(FSDataset.java:232)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolum= e.checkDirs(FSDataset.java:414)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolum= eSet.checkDirs(FSDataset.java:617)
=A0 =A0 =A0 =A0- locked <0x000000080a8faec0> (a org.apache.hadoop.hd= fs.server.datanode.FSDataset$FSVolumeSet)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDa= taDir(FSDataset.java:1681)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDis= kError(DataNode.java:745)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDis= kError(DataNode.java:735)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.clo= se(BlockReceiver.java:202)
=A0 =A0 =A0 =A0at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151) =A0 =A0 =A0 =A0at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:16= 7)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.rec= eiveBlock(BlockReceiver.java:646)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWri= teBlock(DataXceiver.java:352)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Rec= eiver.opWriteBlock(DataTransferProtocol.java:390)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Rec= eiver.processOp(DataTransferProtocol.java:331)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(D= ataXceiver.java:111)
=A0 =A0 =A0 =A0at java.lang.Thread.run(Thread.java:619)

3) This produces timeouts on other calls, e.g.
2011-06-17 17:35:03,922 WARN org.apache.hadoop.hdfs.server.datanode.DataNod= e: checkDiskError: exception:
java.io.InterruptedIOException
=A0 =A0 =A0 =A0at java.io.FileOutputStream.writeBytes(Native Method)
=A0 =A0 =A0 =A0at java.io.FileOutputStream.write(FileOutputStream.java:260= )
=A0 =A0 =A0 =A0at java.io.BufferedOutputStream.flushBuffer(BufferedOutputS= tream.java:65)
=A0 =A0 =A0 =A0at java.io.BufferedOutputStream.flush(BufferedOutputStream.= java:123)
=A0 =A0 =A0 =A0at java.io.DataOutputStream.flush(DataOutputStream.java:106= )
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.clo= se(BlockReceiver.java:183)
=A0 =A0 =A0 =A0at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151) =A0 =A0 =A0 =A0at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:16= 7)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.rec= eiveBlock(BlockReceiver.java:646)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWri= teBlock(DataXceiver.java:352)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Rec= eiver.opWriteBlock(DataTransferProtocol.java:390)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Rec= eiver.processOp(DataTransferProtocol.java:331)
=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(D= ataXceiver.java:111)
=A0 =A0 =A0 =A0at java.lang.Thread.run(Thread.java:619)

4) This, in turn, produces more "dir check calls".

5) All the cluster works very slow because of half-working node.

--=A0
Best regards,
=A0Vitalii= Tymchyshyn
--00235429de6cc5c85204a6227f8e--