Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 26191 invoked from network); 14 Feb 2011 07:41:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Feb 2011 07:41:03 -0000 Received: (qmail 79754 invoked by uid 500); 14 Feb 2011 07:41:02 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 79269 invoked by uid 500); 14 Feb 2011 07:41:00 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 79261 invoked by uid 99); 14 Feb 2011 07:40:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Feb 2011 07:40:58 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bradfordstephens@gmail.com designates 74.125.82.169 as permitted sender) Received: from [74.125.82.169] (HELO mail-wy0-f169.google.com) (74.125.82.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Feb 2011 07:40:54 +0000 Received: by wyj26 with SMTP id 26so4451693wyj.14 for ; Sun, 13 Feb 2011 23:40:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=j+TcdtukOHIgGG7IrrWr6qUWnXRsieaoEpc/ZDK0wCI=; b=nUnhkEPPk56YL1opQRI9XpGL662XXMcbhHpmOtV2PU6r4Cnxv+HQgkSOeMrQHnjgiJ HvMEXfYtPCob2ylcuzt0tdpUfWbJxlM816jh4h4S6nXpbhKVX0P742wQt0HPKMDSgkDn 3yGlPAoOOvOFAbR4Lk/xIH9Un45uQDC74kksk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=KAf0CMHTCUGgTnRt6kGFvDY/nYUWKwOIvfDb7eg6EQTYZxtVOeVHgt6xKVvF7MnS5E hskK+gjzuRQgM7jUDNZdfqhq9ZU/IULr6+vzDTF+dQxzb8952eWpXVin5aQXpNGoh+5I ksNf5/hZvBW9/c3GGRALZvPEymJfm5QkTzjXE= Received: by 10.216.78.133 with SMTP id g5mr2758913wee.24.1297669232891; Sun, 13 Feb 2011 23:40:32 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.245.76 with HTTP; Sun, 13 Feb 2011 23:40:12 -0800 (PST) In-Reply-To: <4A2E351F553A2C4D89148CC937FA8182041EA9@SC-MBX02-1.TheFacebook.com> References: <4A2E351F553A2C4D89148CC937FA8182041EA9@SC-MBX02-1.TheFacebook.com> From: Bradford Stephens Date: Sun, 13 Feb 2011 23:40:12 -0800 Message-ID: Subject: Re: "Error recovery for block... failed because recovery from primary datanode failed 6 times" To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable We've got dfs.replication =3D 3 in hdfs-site.xml doing a grep for "FATAL" and the surrounding 50 lines yields this: Regionserver log: http://pastebin.com/3cYYNhct HMaster and DataNode logs seem pretty boring, no errors. Some sections of lots of scheduling/deleting blocks... Restarted the HBase nodes, ran the MR job again (it's just reading CSV into a table). Seems to be running just fine. On Sun, Feb 13, 2011 at 11:08 PM, Jonathan Gray wrote: > The DFS errors are after the server aborts. =A0What is in the log before = the server abort? =A0Doesn't seem to show any reason here which is unusual. > > Anything in the master? =A0Did it time out this RS? =A0You're running wit= h replication =3D 1? > >> -----Original Message----- >> From: Bradford Stephens [mailto:bradfordstephens@gmail.com] >> Sent: Sunday, February 13, 2011 10:59 PM >> To: user@hbase.apache.org >> Subject: "Error recovery for block... failed because recovery from prima= ry >> datanode failed 6 times" >> >> Hey guys, >> >> I'm occasionally getting regionservers going down (running a late RC of = .89 >> that Ryan built). 5x c2.xlarge nodes (8gb/6 cores?) on EC2 with EBS driv= es. >> >> Here's the error message from the RS log. Hadoop fsck shows it's fine. >> >> Any ideas? >> >> >> 2011-02-14 01:51:51,715 INFO >> org.apache.hadoop.hbase.regionserver.HRegion: Closed mobile4- >> 2011021,20110122:37b16319-58e8-4809-bca6-83d7598a41dd:E84F9612-CE1A- >> 4FE1-AAE9- >> 2A7AF8C9B2F1:21519,1297657239532.d15ce98030138cad79e248e0845b70ee. >> 2011-02-14 01:51:51,715 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server >> at: ip-10-243-106-63.ec2.internal,60020,1297656774012 >> 2011-02-14 01:51:51,711 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionCh >> ecker: >> regionserver60020.majorCompactionChecker exiting >> 2011-02-14 01:51:51,856 INFO org.apache.zookeeper.ZooKeeper: Session: >> 0x12e225ef5640002 closed >> 2011-02-14 01:51:51,856 DEBUG >> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: >> > 63.ec2.internal,60020,1297656773719>Closed >> connection with ZooKeeper; /hbase/root-region-server >> 2011-02-14 01:51:58,706 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread >> exiting >> 2011-02-14 01:51:58,706 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 >> exiting >> 2011-02-14 01:52:00,031 INFO org.apache.hadoop.hbase.Leases: >> regionserver60020.leaseChecker closing leases >> 2011-02-14 01:52:00,031 INFO org.apache.hadoop.hbase.Leases: >> regionserver60020.leaseChecker closed leases >> 2011-02-14 01:52:00,033 INFO >> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook >> starting; hbase.shutdown.hook=3Dtrue; fsShutdownHook=3DThread[Thread- >> 10,5,main] >> 2011-02-14 01:52:00,033 INFO >> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs >> shutdown hook thread. >> 2011-02-14 01:52:00,036 ERROR org.apache.hadoop.hdfs.DFSClient: >> Exception closing file >> /hbase-entest/.logs/ip-10-243-106- >> 63.ec2.internal,60020,1297656774012/10.243.106.63%3A60020.1297660376363 >> : java.io.IOException: IOException flush:java.io.IOException: >> IOException flush:java.io.IOException: IOException >> flush:java.io.IOException: Error Recovery for block >> blk_208685344091455182_10263 failed =A0because recovery from primary >> datanode 10.243.106.63:50010 failed 6 times. =A0Pipeline was >> 10.243.106.63:50010. Aborting... >> java.io.IOException: IOException flush:java.io.IOException: >> IOException flush:java.io.IOException: IOException >> flush:java.io.IOException: Error Recovery for block >> blk_208685344091455182_10263 failed =A0because recovery from primary >> datanode 10.243.106.63:50010 failed 6 times. =A0Pipeline was >> 10.243.106.63:50010. Aborting... >> =A0 =A0 =A0 at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3 >> 214) >> =A0 =A0 =A0 at >> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java: >> 97) >> =A0 =A0 =A0 at >> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944) >> =A0 =A0 =A0 at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown >> Source) >> =A0 =A0 =A0 at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >> sorImpl.java:25) >> =A0 =A0 =A0 at java.lang.reflect.Method.invoke(Method.java:597) >> =A0 =A0 =A0 at >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Se >> quenceFileLogWriter.java:123) >> =A0 =A0 =A0 at >> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:906) >> =A0 =A0 =A0 at >> org.apache.hadoop.hbase.regionserver.wal.HLog.completeCacheFlush(HLog >> .java:1078) >> =A0 =A0 =A0 at >> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegio >> n.java:943) >> =A0 =A0 =A0 at >> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegio >> n.java:834) >> =A0 =A0 =A0 at >> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:78 >> 6) >> =A0 =A0 =A0 at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(Me >> mStoreFlusher.java:250) >> =A0 =A0 =A0 at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(Me >> mStoreFlusher.java:224) >> =A0 =A0 =A0 at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFl >> usher.java:146) >> 2011-02-14 01:52:00,076 INFO >> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook >> finished. >> 2011-02-14 01:52:00,139 WARN >> org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No >> longer connected to ZooKeeper, current state: Disconnected >> >> >> -- >> Bradford Stephens, >> Founder, Drawn to Scale >> drawntoscalehq.com >> 727.697.7528 >> >> http://www.drawntoscalehq.com --=A0 The intuitive, cloud-scale data solu= tion. >> Process, store, query, search, and serve all your data. >> >> http://www.roadtofailure.com -- The Fringes of Scalability, Social Media= , and >> Computer Science > --=20 Bradford Stephens, Founder, Drawn to Scale drawntoscalehq.com 727.697.7528 http://www.drawntoscalehq.com --=A0 The intuitive, cloud-scale data solution. Process, store, query, search, and serve all your data. http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science