Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7135110E76 for ; Wed, 5 Jun 2013 10:54:51 +0000 (UTC) Received: (qmail 93449 invoked by uid 500); 5 Jun 2013 10:54:49 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 93336 invoked by uid 500); 5 Jun 2013 10:54:49 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 93326 invoked by uid 99); 5 Jun 2013 10:54:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jun 2013 10:54:48 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vkjk89@gmail.com designates 209.85.223.170 as permitted sender) Received: from [209.85.223.170] (HELO mail-ie0-f170.google.com) (209.85.223.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jun 2013 10:54:43 +0000 Received: by mail-ie0-f170.google.com with SMTP id e14so3225730iej.15 for ; Wed, 05 Jun 2013 03:54:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=3kkSMAr+AItYvgnXKEyzQ7fPZIQSQbfh4Fb6v++lGPM=; b=oDTAkkit/Bt1bKwBN+ZAG+80g7MICjqktqd7+M5uJijZY8k56/x8Jpu0M9iwmByLTN 2Ejv4NUuj8aMv5IdJ9OfMlgnAR8htbo/F1JOFJYFUahJvliUbeLsskkY2/Q2GF1xmn3c 86l+ldwX4s8Q3wSZk5JgbZyAMOzQ/des5oRU1E9iCDQ3ORa86Ezw+D6i5jTICXSMzU6V dvUXfn3R/ijw02/t4QpFR70gQ8h+q3xSIUeuRHY8+LHioc1ZoNCQT/aLXLvoZAWh5ZwR 4bHA1h69v4GdLy9v3jN/cPHEQAjdOM8IXl66GoqTr0uqs+Id6vV0kQdRbNSLuA2Goz03 XMvA== MIME-Version: 1.0 X-Received: by 10.50.114.161 with SMTP id jh1mr2914977igb.112.1370429662531; Wed, 05 Jun 2013 03:54:22 -0700 (PDT) Received: by 10.64.30.166 with HTTP; Wed, 5 Jun 2013 03:54:22 -0700 (PDT) In-Reply-To: References: Date: Wed, 5 Jun 2013 16:24:22 +0530 Message-ID: Subject: Re: HMaster and HRegionServer going down From: Vimal Jain To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=089e0122ac4439e65304de66051b X-Virus-Checked: Checked by ClamAV on apache.org --089e0122ac4439e65304de66051b Content-Type: text/plain; charset=ISO-8859-1 Yes.I did check those. But i am not sure if those parameter setting is the issue , as there are some other exceptions in logs ( "DFSOutputStream ResponseProcessor exception " etc . ) On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu wrote: > There are a few tips under : > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > > Can you check ? > > Thanks > > On Jun 5, 2013, at 2:05 AM, Vimal Jain wrote: > > > I don't think so , as i dont find any issues in data node logs. > > Also there are lot of exceptions like "session expired" , "slept more > than > > configured time" . what are these ? > > > > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu wrote: > > > >> Because your data node 192.168.20.30 broke down. which leads to RS down. > >> > >> > >> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain wrote: > >> > >>> Here is the complete log: > >>> > >>> http://bin.cakephp.org/saved/103001 - Hregion > >>> http://bin.cakephp.org/saved/103000 - Hmaster > >>> http://bin.cakephp.org/saved/103002 - Datanode > >>> > >>> > >>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain wrote: > >>> > >>>> Hi, > >>>> I have set up Hbase in pseudo-distributed mode. > >>>> It was working fine for 6 days , but suddenly today morning both > >> HMaster > >>>> and Hregion process went down. > >>>> I checked in logs of both hadoop and hbase. > >>>> Please help here. > >>>> Here are the snippets :- > >>>> > >>>> *Datanode logs:* > >>>> 2013-06-05 05:12:51,436 INFO > >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > >>> receiveBlock > >>>> for block blk_1597245478875608321_2818 java.io.EOFException: while > >> trying > >>>> to read 2347 bytes > >>>> 2013-06-05 05:12:51,442 INFO > >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > >>>> blk_1597245478875608321_2818 received exception java.io.EOFException: > >>> while > >>>> trying to read 2347 bytes > >>>> 2013-06-05 05:12:51,442 ERROR > >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > >>>> 192.168.20.30:50010, > >>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237, > >>> infoPort=50075, > >>>> ipcPort=50020):DataXceiver > >>>> java.io.EOFException: while trying to read 2347 bytes > >>>> > >>>> > >>>> *HRegion logs:* > >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We > >>>> slept 4694929ms instead of 3000ms, this is likely due to a long > garbage > >>>> collecting pause and it's usually bad, see > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient: > >>>> DFSOutputStream ResponseProcessor exception for block > >>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000 > >> millis > >>>> timeout while waiting for channel to be ready for read. ch : > >>>> java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333 > >>> remote=/ > >>>> 192.168.20.30:50010] > >>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We > >>>> slept 11695345ms instead of 10000000ms, this is likely due to a long > >>>> garbage collecting pause and it's usually bad, see > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error > >>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0] > >>>> 192.168.20.30:50010 > >>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error > >>> while > >>>> syncing > >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. > >>>> Aborting... > >>>> at > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096) > >>>> 2013-06-05 05:12:51,110 FATAL > >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. > >> Requesting > >>>> close of hlog > >>>> java.io.IOException: Reflection > >>>> Caused by: java.lang.reflect.InvocationTargetException > >>>> Caused by: java.io.IOException: DFSOutputStream is closed > >>>> 2013-06-05 05:12:51,180 FATAL > >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. > >> Requesting > >>>> close of hlog > >>>> java.io.IOException: Reflection > >>>> Caused by: java.lang.reflect.InvocationTargetException > >>>> Caused by: java.io.IOException: DFSOutputStream is closed > >>>> 2013-06-05 05:12:51,183 ERROR > >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog > >>> writer > >>>> java.io.IOException: Reflection > >>>> Caused by: java.lang.reflect.InvocationTargetException > >>>> Caused by: java.io.IOException: DFSOutputStream is closed > >>>> 2013-06-05 05:12:51,184 WARN > >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close > >>>> failure! error count=1 > >>>> 2013-06-05 05:12:52,557 FATAL > >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > >>> server > >>>> hbase.rummycircle.com,60020,1369877672964: > >>>> regionserver:60020-0x13ef31264d00001 > >> regionserver:60020-0x13ef31264d00001 > >>>> received expired from ZooKeeper, aborting > >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>> KeeperErrorCode = Session expired > >>>> 2013-06-05 05:12:52,557 FATAL > >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer > abort: > >>>> loaded coprocessors are: [] > >>>> 2013-06-05 05:12:52,621 INFO > >>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker > >>>> interrupted while waiting for task, exiting: > >>> java.lang.InterruptedException > >>>> java.io.InterruptedIOException: Aborting compaction of store cfp_info > >> in > >>>> region > >> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e. > >>>> because user requested stop. > >>>> 2013-06-05 05:12:53,425 WARN > >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly > >>> transient > >>>> ZooKeeper exception: > >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com > >>>> ,60020,1369877672964 > >>>> 2013-06-05 05:12:55,426 WARN > >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly > >>> transient > >>>> ZooKeeper exception: > >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com > >>>> ,60020,1369877672964 > >>>> 2013-06-05 05:12:59,427 WARN > >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly > >>> transient > >>>> ZooKeeper exception: > >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com > >>>> ,60020,1369877672964 > >>>> 2013-06-05 05:13:07,427 WARN > >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly > >>> transient > >>>> ZooKeeper exception: > >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com > >>>> ,60020,1369877672964 > >>>> 2013-06-05 05:13:07,427 ERROR > >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper > >> delete > >>>> failed after 3 retries > >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com > >>>> ,60020,1369877672964 > >>>> at > >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > >>>> at > >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > >>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient: > >> Exception > >>>> closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/ > >>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 : > >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. > >>>> Aborting... > >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. > >>>> Aborting... > >>>> at > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096) > >>>> > >>>> > >>>> *HMaster logs:* > >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We > >>>> slept 4702394ms instead of 10000ms, this is likely due to a long > >> garbage > >>>> collecting pause and it's usually bad, see > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We > >>>> slept 4988731ms instead of 300000ms, this is likely due to a long > >> garbage > >>>> collecting pause and it's usually bad, see > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We > >>>> slept 4988726ms instead of 300000ms, this is likely due to a long > >> garbage > >>>> collecting pause and it's usually bad, see > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We > >>>> slept 4698291ms instead of 10000ms, this is likely due to a long > >> garbage > >>>> collecting pause and it's usually bad, see > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We > >>>> slept 4694502ms instead of 1000ms, this is likely due to a long > garbage > >>>> collecting pause and it's usually bad, see > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We > >>>> slept 4694492ms instead of 1000ms, this is likely due to a long > garbage > >>>> collecting pause and it's usually bad, see > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We > >>>> slept 4695589ms instead of 60000ms, this is likely due to a long > >> garbage > >>>> collecting pause and it's usually bad, see > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster: > >>>> Master server abort: loaded coprocessors are: [] > >>>> 2013-06-05 05:12:52,465 INFO > >>> org.apache.hadoop.hbase.master.ServerManager: > >>>> Waiting for region servers count to settle; currently checked in 1, > >> slept > >>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of > >> 4500 > >>>> ms, interval of 1500 ms. > >>>> 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster: > >>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a > >> fatal > >>>> error: > >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>> KeeperErrorCode = Session expired > >>>> 2013-06-05 05:12:53,970 INFO > >>> org.apache.hadoop.hbase.master.ServerManager: > >>>> Waiting for region servers count to settle; currently checked in 1, > >> slept > >>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of > >>> 4500 > >>>> ms, interval of 1500 ms. > >>>> 2013-06-05 05:12:55,476 INFO > >>> org.apache.hadoop.hbase.master.ServerManager: > >>>> Waiting for region servers count to settle; currently checked in 1, > >> slept > >>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of > >>> 4500 > >>>> ms, interval of 1500 ms. > >>>> 2013-06-05 05:12:56,981 INFO > >>> org.apache.hadoop.hbase.master.ServerManager: > >>>> Finished waiting for region servers count to settle; checked in 1, > >> slept > >>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is > >>>> running. > >>>> 2013-06-05 05:12:57,019 INFO > >>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of > >>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964; > >>>> java.io.EOFException > >>>> 2013-06-05 05:17:52,302 WARN > >>>> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting > >>> logs > >>>> in [hdfs:// > >> > 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting > >>> ] > >>>> installed = 19 but only 0 done > >>>> 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster: > >>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received > >>>> expired from ZooKeeper, aborting > >>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>> KeeperErrorCode = Session expired > >>>> java.io.IOException: Giving up after tries=1 > >>>> Caused by: java.lang.InterruptedException: sleep interrupted > >>>> 2013-06-05 05:17:52,381 ERROR > >>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start > >> master > >>>> java.lang.RuntimeException: HMaster Aborted > >>>> > >>>> > >>>> > >>>> -- > >>>> Thanks and Regards, > >>>> Vimal Jain > >>> > >>> > >>> > >>> -- > >>> Thanks and Regards, > >>> Vimal Jain > > > > > > > > -- > > Thanks and Regards, > > Vimal Jain > -- Thanks and Regards, Vimal Jain --089e0122ac4439e65304de66051b--