Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B9CF10249 for ; Tue, 23 Apr 2013 02:35:56 +0000 (UTC) Received: (qmail 24669 invoked by uid 500); 23 Apr 2013 02:35:54 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 24595 invoked by uid 500); 23 Apr 2013 02:35:54 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 24585 invoked by uid 99); 23 Apr 2013 02:35:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Apr 2013 02:35:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of yuzhihong@gmail.com designates 209.85.217.182 as permitted sender) Received: from [209.85.217.182] (HELO mail-lb0-f182.google.com) (209.85.217.182) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Apr 2013 02:35:49 +0000 Received: by mail-lb0-f182.google.com with SMTP id v20so203340lbc.27 for ; Mon, 22 Apr 2013 19:35:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=Id0oocfaCRNO1ZU8p0eLrjDAeIJpI9lwlUCP6JC49IU=; b=zd5aMJsYF+RGCnVg3iFVLiZ6lGf/X1QlE6lB+jCKWKGewFmZB/sqStW4Q/6i/qPJnJ aNkFZ4YVbp+YDLUEXdTi3cwDza4jbkSpZB+Wo9zlCOTArS/imPqaCvEIAG5v231NXPrY G7ACOq+dbAB5WiYQ4MAzJg1zKu7K83M6rHyifPOb2AGtcKRLnQynwtVHKJ1IkII0+qnZ 6ylbH3rSLEjXSW7ujX+WSA9318GqpADpswCH9Zco+PBEzcXRO4IZqP4SeltR5r5RtAd2 /Ys4wbPu3B+4gCj5Z3HSniOP084Amu7HIFkWS5uQTxMIQJ11oVzJkaEvIfHUjp2mYJPl p8ig== MIME-Version: 1.0 X-Received: by 10.152.6.194 with SMTP id d2mr10784971laa.39.1366684528149; Mon, 22 Apr 2013 19:35:28 -0700 (PDT) Received: by 10.112.5.101 with HTTP; Mon, 22 Apr 2013 19:35:28 -0700 (PDT) In-Reply-To: References: <8F9014D220A44720B10681A6C2788B98@opendns.com> <711B3F78CB11483BA4E5E10198771322@opendns.com> <5175E324.9080401@plutoz.com> Date: Mon, 22 Apr 2013 19:35:28 -0700 Message-ID: Subject: Re: help why do my regionservers shut themselves down? From: Ted Yu To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=089e013d186ad2090604dafe0950 X-Virus-Checked: Checked by ClamAV on apache.org --089e013d186ad2090604dafe0950 Content-Type: text/plain; charset=ISO-8859-1 Kaveh: What version of HBase are you using ? Around 2013-04-22 16:47:56, did you observe anything else happening in your cluster ? See below: 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.**regionserver.HRegion: compaction interrupted by user: java.io.**InterruptedIOException: Aborting compaction of store f in region t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**9f565d5 da3468c0725e590dc232abc**23. because user requested stop. at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store. java:998) at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store. java:779) at org.apache.hadoop.hbase.**regionserver.HRegion.**compactStores( HRegion.java:**776) On Mon, Apr 22, 2013 at 6:46 PM, Jean-Marc Spaggiari < jean-marc@spaggiari.org> wrote: > Hi Kaveh, > > the respons is maybe already displayed on the logs you sent ;) > > "This disconnect could have been caused by a network partition or a > long-running GC pause, either way it's recommended that you verify > your environment." > > Do you have GC logs? Have you tried anything to solve that? > > JM > > 2013/4/22 kaveh minooie : > > > > Hi > > > > after a few mapreduce jobs my regionservers shut themselves down. this is > > the latest time that this has happened: > > > > 2013-04-22 16:47:21,843 INFO > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > > This client just lost it's session with ZooKeeper, trying to reconnect. > > 2013-04-22 16:47:21,843 FATAL > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > server > > serverName=d1r1n17.prod.plutoz.com,60020,1366657358443, load=(requests=5 > > 392, regions=196, usedHeap=1063, maxHeap=3966): > > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 > > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired > fr > > om ZooKeeper, aborting > > org.apache.zookeeper.KeeperException$SessionExpiredException: > > KeeperErrorCode = Session expired > > at > > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352) > > at > > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270) > > at > > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:523) > > at > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499) > > 2013-04-22 16:47:21,843 INFO > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > > Trying to reconnect to zookeeper. > > 2013-04-22 16:47:21,844 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > > requests=1794, regions=196, stores=1561, storefiles=1585, > > storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10, > > flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032, > > blockCacheFree=169901776, blockCacheCount=7242, > blockCacheHitCount=910925, > > blockCacheMissCount=1558134, blockCacheEvictedCount=1344753, > > blockCacheHitRatio=36, blockCacheHitCachingRatio=40 > > 2013-04-22 16:47:21,844 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: > > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 > > regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired > from > > ZooKeeper, aborting > > 2013-04-22 16:47:21,844 INFO org.apache.zookeeper.ClientCnxn: EventThread > > shut down > > 2013-04-22 16:47:21,900 WARN > org.apache.hadoop.hbase.regionserver.wal.HLog: > > Too many consecutive RollWriter requests, it's a sign of the total > number of > > live datanodes is lower than the tolerable replicas. > > 2013-04-22 16:47:22,341 INFO org.apache.zookeeper.ZooKeeper: Initiating > > client connection, connectString=zk1:2181 sessionTimeout=180000 > > watcher=hconnection > > 2013-04-22 16:47:22,357 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions > to > > close > > 2013-04-22 16:47:22,394 INFO org.apache.zookeeper.ClientCnxn: Opening > socket > > connection to server d1r2n2.prod.plutoz.com/10.0.0.66:2181. Will not > attempt > > to authenticate using SASL (unknown error) > > 2013-04-22 16:47:22,395 INFO org.apache.zookeeper.ClientCnxn: Socket > > connection established to d1r2n2.prod.plutoz.com/10.0.0.66:2181, > initiating > > session > > 2013-04-22 16:47:22,397 INFO org.apache.zookeeper.ClientCnxn: Session > > establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.66:2181, > > sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000 > > 2013-04-22 16:47:22,400 INFO > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > > Reconnected successfully. This disconnect could have been caused by a > > network partition or a long-running GC pause, either way it's recommended > > that you verify your environment. > > 2013-04-22 16:47:22,400 INFO org.apache.zookeeper.ClientCnxn: EventThread > > shut down > > 2013-04-22 16:47:56,830 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > compaction interrupted by user: > > java.io.InterruptedIOException: Aborting compaction of store f in region > > > t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. > > because user requested stop. > > at > > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998) > > at > > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779) > > at > > > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776) > > at > > > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721) > > at > > > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > > 2013-04-22 16:47:56,830 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > aborted compaction on region > > > t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. > > after 5mins, 58sec > > 2013-04-22 16:47:56,830 INFO > > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > > regionserver60020.compactor exiting > > 2013-04-22 16:47:56,832 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > Closed > > > t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. > > 2013-04-22 16:47:57,363 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > > regionserver60020.logSyncer exiting > > 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases: > > regionserver60020 closing leases > > 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases: > > regionserver60020 closed leases > > 2013-04-22 16:47:57,366 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 > > exiting > > 2013-04-22 16:47:57,497 INFO > > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook > starting; > > hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-15,5,main] > > 2013-04-22 16:47:57,497 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown > hook > > 2013-04-22 16:47:57,497 INFO > > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown > hook > > thread. > > 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases: > > regionserver60020.leaseChecker closing leases > > 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases: > > regionserver60020.leaseChecker closed leases > > 2013-04-22 16:47:57,598 INFO > > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook > finished. > > > > I would appreciate it very much if someone could explain to me what just > > happened here. > > > > thanks, > --089e013d186ad2090604dafe0950--