Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 533AC184C1 for ; Mon, 8 Feb 2016 21:08:41 +0000 (UTC) Received: (qmail 18217 invoked by uid 500); 8 Feb 2016 21:08:40 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 18136 invoked by uid 500); 8 Feb 2016 21:08:40 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 18089 invoked by uid 99); 8 Feb 2016 21:08:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Feb 2016 21:08:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E39D72C1F60 for ; Mon, 8 Feb 2016 21:08:39 +0000 (UTC) Date: Mon, 8 Feb 2016 21:08:39 +0000 (UTC) From: "Gary Helmling (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-15234) ReplicationLogCleaner can abort due to transient ZK issues MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137687#comment-15137687 ] Gary Helmling commented on HBASE-15234: --------------------------------------- bq. Is it possible that excessive logging is produced for the above approach ? Well there is already some excess logging due to logging the same stack trace 3 times in the log section shown. We can certainly clean that up. I don't think log spew will be too excessive, given that log cleaner only runs every 60 seconds by default, but we can save that discussion for review once I get a patch up. > ReplicationLogCleaner can abort due to transient ZK issues > ---------------------------------------------------------- > > Key: HBASE-15234 > URL: https://issues.apache.org/jira/browse/HBASE-15234 > Project: HBase > Issue Type: Bug > Components: master, Replication > Reporter: Gary Helmling > Assignee: Gary Helmling > Priority: Critical > > The ReplicationLogCleaner delegate for the LogCleaner chore can abort due to transient errors reading the replication znodes, leaving the log cleaner chore stopped, but the master still running. This causes logs to build up in the oldWALs directory, which can even hit storage or file count limits in HDFS, causing problems. > We've seen this happen in a couple of clusters when a rolling restart was performed on the zk peers (only one restarted at a time). > The full stack trace when the log cleaner aborts is: > {noformat} > 16/02/02 15:22:39 WARN zookeeper.ZKUtil: replicationLogCleaner-0x1522c8b93c2fbae, quorum=XXXXXXXXXXXXXXXXXXXX, baseZNode=/hbase Unable to get data of znode /hbase/replication/rs > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/replication/rs > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713) > at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:80) > at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99) > at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124) > at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:110) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/02/02 15:22:39 ERROR zookeeper.ZooKeeperWatcher: replicationLogCleaner-0x1522c8b93c2fbae, quorum=XXXXXXXXXXXXXXXXXXXX, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/replication/rs > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713) > at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:80) > at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99) > at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124) > at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:110) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/02/02 15:22:39 WARN master.ReplicationLogCleaner: Aborting ReplicationLogCleaner because Failed to get stat of replication rs node > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/replication/rs > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713) > at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:80) > at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99) > at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124) > at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:110) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/02/02 15:22:40 WARN master.ReplicationLogCleaner: Failed to read zookeeper, skipping checking deletable files > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)