Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8E87E10E53 for ; Wed, 5 Jun 2013 10:49:43 +0000 (UTC) Received: (qmail 77211 invoked by uid 500); 5 Jun 2013 10:49:38 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 77124 invoked by uid 500); 5 Jun 2013 10:49:33 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 77116 invoked by uid 99); 5 Jun 2013 10:49:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jun 2013 10:49:31 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of yuzhihong@gmail.com designates 209.85.160.43 as permitted sender) Received: from [209.85.160.43] (HELO mail-pb0-f43.google.com) (209.85.160.43) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jun 2013 10:49:26 +0000 Received: by mail-pb0-f43.google.com with SMTP id md12so534134pbc.30 for ; Wed, 05 Jun 2013 03:49:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:x-mailer:from:subject:date :to; bh=nXjxmN6ul9SicMaTEJ8rOCMESXexUh0Rd1dHvM8z+xE=; b=b456v6QdtekFvjBqTIaAbs2S9dDQ0ruiJY4b4MWsKPobC4gV/eL4ARoHDPxLKrtc0r l3wyUHp6KK+R/NB1nP861bPmCtINxhJjXQpy5UYAgdn/KWLo791ea8vptkD6aWQUOMBV i7zmgQsX6QXv+B3BiPzIYFZOa59axNBCZo7zPMC/zNaCOfoLHohKJoIMEmvcVpLYDnqS WrHh70/wGqt15fDimBtJC6qtf9+ZgS8wDARB9TWmp9XFs7hrBwD0xT71SxxZ8Nf3bR9P g039emZqVwFfhAmrXtXovqGdWC0kePXi8FADRUyPCn99Pewh86FoqXdb4/N9bzZ5Y1O/ HfWg== X-Received: by 10.68.107.225 with SMTP id hf1mr32647838pbb.130.1370429345818; Wed, 05 Jun 2013 03:49:05 -0700 (PDT) Received: from [192.168.0.14] (c-24-130-233-55.hsd1.ca.comcast.net. [24.130.233.55]) by mx.google.com with ESMTPSA id xe9sm179502pbc.21.2013.06.05.03.49.03 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 05 Jun 2013 03:49:04 -0700 (PDT) References: Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: multipart/alternative; boundary=Apple-Mail-288C1261-1B71-4DC4-8EA9-5827CA14D006 Content-Transfer-Encoding: 7bit Message-Id: Cc: "user@hbase.apache.org" X-Mailer: iPhone Mail (10B146) From: Ted Yu Subject: Re: HMaster and HRegionServer going down Date: Wed, 5 Jun 2013 03:49:04 -0700 To: "user@hbase.apache.org" X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-288C1261-1B71-4DC4-8EA9-5827CA14D006 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable There are a few tips under : http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired Can you check ? Thanks On Jun 5, 2013, at 2:05 AM, Vimal Jain wrote: > I don't think so , as i dont find any issues in data node logs. > Also there are lot of exceptions like "session expired" , "slept more than= > configured time" . what are these ? >=20 >=20 > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu wrote: >=20 >> Because your data node 192.168.20.30 broke down. which leads to RS down. >>=20 >>=20 >> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain wrote: >>=20 >>> Here is the complete log: >>>=20 >>> http://bin.cakephp.org/saved/103001 - Hregion >>> http://bin.cakephp.org/saved/103000 - Hmaster >>> http://bin.cakephp.org/saved/103002 - Datanode >>>=20 >>>=20 >>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain wrote: >>>=20 >>>> Hi, >>>> I have set up Hbase in pseudo-distributed mode. >>>> It was working fine for 6 days , but suddenly today morning both >> HMaster >>>> and Hregion process went down. >>>> I checked in logs of both hadoop and hbase. >>>> Please help here. >>>> Here are the snippets :- >>>>=20 >>>> *Datanode logs:* >>>> 2013-06-05 05:12:51,436 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in >>> receiveBlock >>>> for block blk_1597245478875608321_2818 java.io.EOFException: while >> trying >>>> to read 2347 bytes >>>> 2013-06-05 05:12:51,442 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock >>>> blk_1597245478875608321_2818 received exception java.io.EOFException: >>> while >>>> trying to read 2347 bytes >>>> 2013-06-05 05:12:51,442 ERROR >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( >>>> 192.168.20.30:50010, >>>> storageID=3DDS-1816106352-192.168.20.30-50010-1369314076237, >>> infoPort=3D50075, >>>> ipcPort=3D50020):DataXceiver >>>> java.io.EOFException: while trying to read 2347 bytes >>>>=20 >>>>=20 >>>> *HRegion logs:* >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 4694929ms instead of 3000ms, this is likely due to a long garbage= >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient: >>>> DFSOutputStream ResponseProcessor exception for block >>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000 >> millis >>>> timeout while waiting for channel to be ready for read. ch : >>>> java.nio.channels.SocketChannel[connected local=3D/192.168.20.30:44333 >>> remote=3D/ >>>> 192.168.20.30:50010] >>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 11695345ms instead of 10000000ms, this is likely due to a long >>>> garbage collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error >>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0] >>>> 192.168.20.30:50010 >>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error >>> while >>>> syncing >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. >>>> Aborting... >>>> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS= Client.java:3096) >>>> 2013-06-05 05:12:51,110 FATAL >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. >> Requesting >>>> close of hlog >>>> java.io.IOException: Reflection >>>> Caused by: java.lang.reflect.InvocationTargetException >>>> Caused by: java.io.IOException: DFSOutputStream is closed >>>> 2013-06-05 05:12:51,180 FATAL >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. >> Requesting >>>> close of hlog >>>> java.io.IOException: Reflection >>>> Caused by: java.lang.reflect.InvocationTargetException >>>> Caused by: java.io.IOException: DFSOutputStream is closed >>>> 2013-06-05 05:12:51,183 ERROR >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog >>> writer >>>> java.io.IOException: Reflection >>>> Caused by: java.lang.reflect.InvocationTargetException >>>> Caused by: java.io.IOException: DFSOutputStream is closed >>>> 2013-06-05 05:12:51,184 WARN >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close >>>> failure! error count=3D1 >>>> 2013-06-05 05:12:52,557 FATAL >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >>> server >>>> hbase.rummycircle.com,60020,1369877672964: >>>> regionserver:60020-0x13ef31264d00001 >> regionserver:60020-0x13ef31264d00001 >>>> received expired from ZooKeeper, aborting >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode =3D Session expired >>>> 2013-06-05 05:12:52,557 FATAL >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:= >>>> loaded coprocessors are: [] >>>> 2013-06-05 05:12:52,621 INFO >>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker >>>> interrupted while waiting for task, exiting: >>> java.lang.InterruptedException >>>> java.io.InterruptedIOException: Aborting compaction of store cfp_info >> in >>>> region >> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e. >>>> because user requested stop. >>>> 2013-06-05 05:12:53,425 WARN >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>> transient >>>> ZooKeeper exception: >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com= >>>> ,60020,1369877672964 >>>> 2013-06-05 05:12:55,426 WARN >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>> transient >>>> ZooKeeper exception: >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com= >>>> ,60020,1369877672964 >>>> 2013-06-05 05:12:59,427 WARN >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>> transient >>>> ZooKeeper exception: >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com= >>>> ,60020,1369877672964 >>>> 2013-06-05 05:13:07,427 WARN >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>> transient >>>> ZooKeeper exception: >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com= >>>> ,60020,1369877672964 >>>> 2013-06-05 05:13:07,427 ERROR >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper >> delete >>>> failed after 3 retries >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com= >>>> ,60020,1369877672964 >>>> at >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127) >>>> at >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient: >> Exception >>>> closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/ >>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 : >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. >>>> Aborting... >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. >>>> Aborting... >>>> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS= Client.java:3096) >>>>=20 >>>>=20 >>>> *HMaster logs:* >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 4702394ms instead of 10000ms, this is likely due to a long >> garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 4988731ms instead of 300000ms, this is likely due to a long >> garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 4988726ms instead of 300000ms, this is likely due to a long >> garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 4698291ms instead of 10000ms, this is likely due to a long >> garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 4694502ms instead of 1000ms, this is likely due to a long garbage= >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 4694492ms instead of 1000ms, this is likely due to a long garbage= >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 4695589ms instead of 60000ms, this is likely due to a long >> garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>> 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster: >>>> Master server abort: loaded coprocessors are: [] >>>> 2013-06-05 05:12:52,465 INFO >>> org.apache.hadoop.hbase.master.ServerManager: >>>> Waiting for region servers count to settle; currently checked in 1, >> slept >>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of >> 4500 >>>> ms, interval of 1500 ms. >>>> 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster: >>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a >> fatal >>>> error: >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode =3D Session expired >>>> 2013-06-05 05:12:53,970 INFO >>> org.apache.hadoop.hbase.master.ServerManager: >>>> Waiting for region servers count to settle; currently checked in 1, >> slept >>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of >>> 4500 >>>> ms, interval of 1500 ms. >>>> 2013-06-05 05:12:55,476 INFO >>> org.apache.hadoop.hbase.master.ServerManager: >>>> Waiting for region servers count to settle; currently checked in 1, >> slept >>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of >>> 4500 >>>> ms, interval of 1500 ms. >>>> 2013-06-05 05:12:56,981 INFO >>> org.apache.hadoop.hbase.master.ServerManager: >>>> Finished waiting for region servers count to settle; checked in 1, >> slept >>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is >>>> running. >>>> 2013-06-05 05:12:57,019 INFO >>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of >>>> -ROOT-,,0 at address=3Dhbase.rummycircle.com,60020,1369877672964; >>>> java.io.EOFException >>>> 2013-06-05 05:17:52,302 WARN >>>> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting >>> logs >>>> in [hdfs:// >> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-= splitting >>> ] >>>> installed =3D 19 but only 0 done >>>> 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster: >>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received >>>> expired from ZooKeeper, aborting >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>> KeeperErrorCode =3D Session expired >>>> java.io.IOException: Giving up after tries=3D1 >>>> Caused by: java.lang.InterruptedException: sleep interrupted >>>> 2013-06-05 05:17:52,381 ERROR >>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start >> master >>>> java.lang.RuntimeException: HMaster Aborted >>>>=20 >>>>=20 >>>>=20 >>>> -- >>>> Thanks and Regards, >>>> Vimal Jain >>>=20 >>>=20 >>>=20 >>> -- >>> Thanks and Regards, >>> Vimal Jain >=20 >=20 >=20 > --=20 > Thanks and Regards, > Vimal Jain --Apple-Mail-288C1261-1B71-4DC4-8EA9-5827CA14D006--