Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A3954200C1D for ; Thu, 2 Feb 2017 00:06:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A2330160B5E; Wed, 1 Feb 2017 23:06:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C700B160B46 for ; Thu, 2 Feb 2017 00:06:04 +0100 (CET) Received: (qmail 76909 invoked by uid 500); 1 Feb 2017 23:06:04 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 76898 invoked by uid 99); 1 Feb 2017 23:06:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2017 23:06:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 83A03C0115 for ; Wed, 1 Feb 2017 23:06:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id uudWFUXITi9B for ; Wed, 1 Feb 2017 23:06:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id E0A085F177 for ; Wed, 1 Feb 2017 23:06:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 498F1E0530 for ; Wed, 1 Feb 2017 23:05:52 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id AB1B325290 for ; Wed, 1 Feb 2017 23:05:51 +0000 (UTC) Date: Wed, 1 Feb 2017 23:05:51 +0000 (UTC) From: "Gary Helmling (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 01 Feb 2017 23:06:05 -0000 [ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849077#comment-15849077 ] Gary Helmling commented on HBASE-17381: --------------------------------------- [~openinx] on the current patches: * since we are only aborting/stopping the regionserver, we can continue to use UncaughtExceptionHandler for this purpose. Since we already create and attach an UncaughtExceptionHandler in ReplicationSourceWorkerThread.startup(), this seems like the right place to fix it. * in the case of an OOME (as checked for in your initial patch), it seems fine to use Runtime.halt(). However this is pretty extreme in any other cases * for other uncaught exceptions, it would be better to use Stoppable.stop(String reason). A Stoppable instance (the regionserver) is passed through to ReplicationSourceManager. We can use this instance to create a UEH that calls Stoppable.stop() if the exception we encounter is not a OOME. This will give regions a chance to close cleanly, etc. and speed recovery. > ReplicationSourceWorkerThread can die due to unhandled exceptions > ----------------------------------------------------------------- > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication > Reporter: Gary Helmling > Assignee: huzheng > Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, HBASE-17381.v2.patch > > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the run() method (for example failure to allocate direct memory for the DFS client), the exception will be logged by the UncaughtExceptionHandler, but the thread will also die and the replication queue will back up indefinitely until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it can actually handle. For those that it really can't, it seems better to abort the regionserver rather than just allow replication to stop with minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in ReplicationSourceWorkerThread, currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)