Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 56972 invoked from network); 17 Apr 2010 00:35:32 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Apr 2010 00:35:32 -0000 Received: (qmail 53137 invoked by uid 500); 17 Apr 2010 00:35:30 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 53092 invoked by uid 500); 17 Apr 2010 00:35:30 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 53082 invoked by uid 99); 17 Apr 2010 00:35:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Apr 2010 00:35:30 +0000 X-ASF-Spam-Status: No, hits=0.9 required=10.0 tests=AWL,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.223.198] (HELO mail-iw0-f198.google.com) (209.85.223.198) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Apr 2010 00:35:25 +0000 Received: by iwn36 with SMTP id 36so1817229iwn.29 for ; Fri, 16 Apr 2010 17:35:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.197.96 with HTTP; Fri, 16 Apr 2010 17:34:44 -0700 (PDT) In-Reply-To: <39FC1A24-6DEE-43AF-9F4C-D390876C2579@richrelevance.com> References: <39FC1A24-6DEE-43AF-9F4C-D390876C2579@richrelevance.com> From: Todd Lipcon Date: Fri, 16 Apr 2010 17:34:44 -0700 Received: by 10.231.147.148 with SMTP id l20mr790527ibv.77.1271464504117; Fri, 16 Apr 2010 17:35:04 -0700 (PDT) Message-ID: Subject: Re: Extremely slow HDFS after upgrade To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e6498a5c1ccb5d048463e609 --0016e6498a5c1ccb5d048463e609 Content-Type: text/plain; charset=ISO-8859-1 Hey Scott, This is indeed really strange... if you do a straight "hadoop fs -put" with dfs.replication set to 1 from one of the DNs, does it upload slow? That would cut out the network from the equation. -Todd On Fri, Apr 16, 2010 at 5:29 PM, Scott Carey wrote: > I have two clusters upgraded to CDH2. One is performing fine, and the > other is EXTREMELY slow. > > Some jobs that formerly took 90 seconds, take 20 to 50 minutes. > > It is an HDFS issue from what I can tell. > > The simple DFS benchmark with one map task shows the problem clearly. I > have looked at every difference I can find and am wondering where else to > look to track this down. > The disks on all nodes in the cluster check out -- capable of 75MB/sec > minimum with a 'dd' write test. > top / iostat do not show any significant CPU usage or iowait times on any > machines in the cluster during the test. > ifconfig does not report any dropped packets or other errors on any machine > in the cluster. dmesg has nothing interesting. > The poorly performing cluster is on a slightly newer CentOS version: > Poor: 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 > x86_64 GNU/Linux (CentOS 5.4, recent patches) > Good: 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 > x86_64 GNU/Linux (CentOS 5.3, I think) > The performance is always poor, not sporadically poor. It is poor with M/R > tasks as well as non-M/R HDFS clients (i.e. sqoop). > > Poor performance cluster (no other jobs active during the test): > --------------------------------------------------- > $ hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write > -nrFiles 1 -fileSize 2000 > 10/04/16 12:53:13 INFO mapred.FileInputFormat: nrFiles = 1 > 10/04/16 12:53:13 INFO mapred.FileInputFormat: fileSize (MB) = 2000 > 10/04/16 12:53:13 INFO mapred.FileInputFormat: bufferSize = 1000000 > 10/04/16 12:53:14 INFO mapred.FileInputFormat: creating control file: 2000 > mega bytes, 1 files > 10/04/16 12:53:14 INFO mapred.FileInputFormat: created control files for: 1 > files > 10/04/16 12:53:14 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the same. > 10/04/16 12:53:15 INFO mapred.FileInputFormat: Total input paths to process > : 1 > 10/04/16 12:53:15 INFO mapred.JobClient: Running job: job_201004091928_0391 > 10/04/16 12:53:16 INFO mapred.JobClient: map 0% reduce 0% > 10/04/16 13:42:30 INFO mapred.JobClient: map 100% reduce 0% > 10/04/16 13:43:06 INFO mapred.JobClient: map 100% reduce 100% > 10/04/16 13:43:07 INFO mapred.JobClient: Job complete: > job_201004091928_0391 > [snip] > 10/04/16 13:43:07 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : > write > 10/04/16 13:43:07 INFO mapred.FileInputFormat: Date & time: Fri > Apr 16 13:43:07 PDT 2010 > 10/04/16 13:43:07 INFO mapred.FileInputFormat: Number of files: 1 > 10/04/16 13:43:07 INFO mapred.FileInputFormat: Total MBytes processed: 2000 > 10/04/16 13:43:07 INFO mapred.FileInputFormat: Throughput mb/sec: > 0.678296742615553 > 10/04/16 13:43:07 INFO mapred.FileInputFormat: Average IO rate mb/sec: > 0.6782967448234558 > 10/04/16 13:43:07 INFO mapred.FileInputFormat: IO rate std deviation: > 9.568803140552889E-5 > 10/04/16 13:43:07 INFO mapred.FileInputFormat: Test exec time sec: > 2992.913 > ------------------------------------ > > Good performance cluster (other jobs active during the test): > --------------------------------- > hadoop jar /usr/lib/hadoop/hadoop-0.20.1+169.68-test.jar TestDFSIO -write > -nrFiles 1 -fileSize 2000 > 10/04/16 12:50:52 WARN conf.Configuration: DEPRECATED: hadoop-site.xml > found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use > core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of > core-default.xml, mapred-default.xml and hdfs-default.xml respectively > TestFDSIO.0.0.4 > 10/04/16 12:50:52 INFO mapred.FileInputFormat: nrFiles = 1 > 10/04/16 12:50:52 INFO mapred.FileInputFormat: fileSize (MB) = 2000 > 10/04/16 12:50:52 INFO mapred.FileInputFormat: bufferSize = 1000000 > 10/04/16 12:50:52 INFO mapred.FileInputFormat: creating control file: 2000 > mega bytes, 1 files > 10/04/16 12:50:52 INFO mapred.FileInputFormat: created control files for: 1 > files > 10/04/16 12:50:52 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the same. > 10/04/16 12:50:53 INFO mapred.FileInputFormat: Total input paths to process > : 1 > 10/04/16 12:50:54 INFO mapred.JobClient: Running job: job_201003311607_4098 > 10/04/16 12:50:55 INFO mapred.JobClient: map 0% reduce 0% > 10/04/16 12:51:22 INFO mapred.JobClient: map 100% reduce 0% > 10/04/16 12:51:32 INFO mapred.JobClient: map 100% reduce 100% > 10/04/16 12:51:32 INFO mapred.JobClient: Job complete: > job_201003311607_4098 > [snip] > 10/04/16 12:51:32 INFO mapred.FileInputFormat: ----- TestDFSIO ----- : > write > 10/04/16 12:51:32 INFO mapred.FileInputFormat: Date & time: Fri > Apr 16 12:51:32 PDT 2010 > 10/04/16 12:51:32 INFO mapred.FileInputFormat: Number of files: 1 > 10/04/16 12:51:32 INFO mapred.FileInputFormat: Total MBytes processed: 2000 > 10/04/16 12:51:32 INFO mapred.FileInputFormat: Throughput mb/sec: > 92.47699634715865 > 10/04/16 12:51:32 INFO mapred.FileInputFormat: Average IO rate mb/sec: > 92.47699737548828 > 10/04/16 12:51:32 INFO mapred.FileInputFormat: IO rate std deviation: > 0.008579127784432736 > 10/04/16 12:51:32 INFO mapred.FileInputFormat: Test exec time sec: > 39.522 > ------------------------------- > > On the namenode logs during the write, there are about 2 minute intervals > between block writes (64MB) > 2010-04-16 12:56:41,081 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0. > blk_5304238373340657846_291044 > 2010-04-16 12:58:15,144 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.3.0.142:50010 is added to > blk_5304238373340657846_291044 size 67108864 > 2010-04-16 12:58:15,145 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.3.0.140:50010 is added to > blk_5304238373340657846_291044 size 67108864 > 2010-04-16 12:58:15,147 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added to > blk_5304238373340657846_291044 size 67108864 > 2010-04-16 12:58:15,148 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0. > blk_-4700875175596279045_291044 > 2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.3.0.147:50010 is added to > blk_-4700875175596279045_291044 size 67108864 > 2010-04-16 12:58:15,792 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.3.0.149:50010 is added to > blk_-4700875175596279045_291044 size 67108864 > 2010-04-16 12:58:15,794 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.3.0.146:50010 is added to > blk_-4700875175596279045_291044 size 67108864 > 2010-04-16 12:58:15,795 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: /benchmarks/TestDFSIO/io_data/test_io_0. > blk_1059681682487510599_291044 > > On the node where the map task ran, its datanode log has similar chunks of > time while it receives the block: > 2010-04-16 12:55:08,303 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_-7985493313278868469_291044 src: /10.3.0.146:48168 dest: / > 10.3.0.146:50010 > 2010-04-16 12:56:42,281 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 10.3.0.146:48168, dest: /10.3.0.146:50010, bytes: 67108864, op: > HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset: > 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid: > blk_-7985493313278868469_291044, duration: 93972113000 > 2010-04-16 12:56:42,282 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_-7985493313278868469_291044 terminating > 2010-04-16 12:56:42,285 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_5304238373340657846_291044 src: /10.3.0.146:48175 dest: / > 10.3.0.146:50010 > 2010-04-16 12:58:16,350 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 10.3.0.146:48175, dest: /10.3.0.146:50010, bytes: 67108864, op: > HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset: > 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid: > blk_5304238373340657846_291044, duration: 94059819000 > 2010-04-16 12:58:16,350 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_5304238373340657846_291044 terminating > 2010-04-16 12:58:16,353 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_-4700875175596279045_291044 src: /10.3.0.146:48182 dest: / > 10.3.0.146:50010 > 2010-04-16 12:58:16,997 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 10.3.0.146:48182, dest: /10.3.0.146:50010, bytes: 67108864, op: > HDFS_WRITE, cliID: DFSClient_attempt_201004091928_0391_m_000000_0, offset: > 0, srvID: DS-350876050-10.3.0.146-50010-1265338485169, blockid: > blk_-4700875175596279045_291044, duration: 637761000 > 2010-04-16 12:58:16,997 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_-4700875175596279045_291044 terminating > > > On a different datanode it looks essentially the same but fewer blocks, > since all blocks are written to the local node once but only (replication > -1)/(cluster size -1) times as often elsewhere. > > I grabbed a few stack traces of the running map task, but I don't think > there is any surprises there -- its waiting on an HDFS write: > "DataStreamer for file /benchmarks/TestDFSIO/io_data/test_io_0 block > blk_6860895546192904948_291044" daemon prio=10 tid=0x00002aab08244000 > nid=0x1031 runnable [0x0000000040c7a000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > - locked <0x00002aaab363dc88> (a sun.nio.ch.Util$1) > - locked <0x00002aaab363dc78> (a > java.util.Collections$UnmodifiableSet) > - locked <0x00002aaab363da50> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > at > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > - locked <0x00002aaaf89023c0> (a java.io.BufferedOutputStream) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > > "main" prio=10 tid=0x0000000055fb6000 nid=0x1017 waiting for monitor entry > [0x000000004112c000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3006) > - waiting to lock <0x00002aaac36bb440> (a java.util.LinkedList) > - locked <0x00002aaac36bae60> (a > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream) > at > org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150) > > > > I can get some stack traces from the namenode and datanodes during this > too. Any other suggestions? Is there anything in a 0.19.2 to 0.20.1+ > upgrade that might end up improperly configured that would have this sort of > symptom? Its almost like HDFS just decides to sleep for a second between > each 64K network transfer. Non-hadoop network transfer (i.e. scp) seems > fine. > > Another bit of information -- its not just HDFS writes, the read test is > almost equally awful. (1.9MB/sec) > > The Map task running the read test is in this stack trace: > "main" prio=10 tid=0x00000000554b3000 nid=0x3376 runnable > [0x0000000040a1d000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > - locked <0x00002aaaf7641740> (a sun.nio.ch.Util$1) > - locked <0x00002aaaf7641730> (a > java.util.Collections$UnmodifiableSet) > - locked <0x00002aaaf7641508> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > at > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > at java.io.BufferedInputStream.read(BufferedInputStream.java:317) > - locked <0x00002aaaf76400f8> (a java.io.BufferedInputStream) > at java.io.DataInputStream.read(DataInputStream.java:132) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) > at > org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1283) > > > The namenode is not doing anything as far as I can tell in the stack > traces. The datanode that is being accessed for data at any given time > seems to be in this stack trace: > "org.apache.hadoop.hdfs.server.datanode.DataXceiver@7b90a2d" daemon > prio=10 tid=0x00002aaaf43b8000 nid=0x79cc runnable [0x000000004433c000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) > - locked <0x00002aaae7ee8440> (a sun.nio.ch.Util$1) > - locked <0x00002aaae7ee8428> (a > java.util.Collections$UnmodifiableSet) > - locked <0x00002aaae7ee8098> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) > at > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) > at > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245) > at > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:401) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) > at java.lang.Thread.run(Thread.java:619) > > > So the behavior is as if the network is extremely slow, but it seems to > only affect Hadoop. Any ideas? > > Thanks, > > -Scott > > -- Todd Lipcon Software Engineer, Cloudera --0016e6498a5c1ccb5d048463e609--