Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 73E91200C15 for ; Wed, 8 Feb 2017 08:19:48 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 7278A160B5A; Wed, 8 Feb 2017 07:19:48 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 94299160B4E for ; Wed, 8 Feb 2017 08:19:47 +0100 (CET) Received: (qmail 64028 invoked by uid 500); 8 Feb 2017 07:19:46 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 64017 invoked by uid 99); 8 Feb 2017 07:19:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Feb 2017 07:19:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 09E54186280 for ; Wed, 8 Feb 2017 07:19:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 6siHUU7EzJaa for ; Wed, 8 Feb 2017 07:19:44 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 671535FB9D for ; Wed, 8 Feb 2017 07:19:44 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 24947E04B5 for ; Wed, 8 Feb 2017 07:19:43 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4AE762529E for ; Wed, 8 Feb 2017 07:19:42 +0000 (UTC) Date: Wed, 8 Feb 2017 07:19:42 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 08 Feb 2017 07:19:48 -0000 [ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857567#comment-15857567 ] Hudson commented on HBASE-17381: -------------------------------- FAILURE: Integrated in Jenkins build HBase-1.4 #618 (See [https://builds.apache.org/job/HBase-1.4/618/]) HBASE-17381 ReplicationSourceWorkerThread can die due to unhandled (garyh: rev 8574934f5912b09b785444036bfee9740c966bbb) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java > ReplicationSourceWorkerThread can die due to unhandled exceptions > ----------------------------------------------------------------- > > Key: HBASE-17381 > URL: https://issues.apache.org/jira/browse/HBASE-17381 > Project: HBase > Issue Type: Bug > Components: Replication > Reporter: Gary Helmling > Assignee: huzheng > Fix For: 2.0.0, 1.4.0, 1.3.1, 1.2.5 > > Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, HBASE-17381.v2.patch, HBASE-17381.v3.patch > > > If a ReplicationSourceWorkerThread encounters an unexpected exception in the run() method (for example failure to allocate direct memory for the DFS client), the exception will be logged by the UncaughtExceptionHandler, but the thread will also die and the replication queue will back up indefinitely until the Regionserver is restarted. > We should make sure the worker thread is resilient to all exceptions that it can actually handle. For those that it really can't, it seems better to abort the regionserver rather than just allow replication to stop with minimal signal. > Here is a sample exception: > {noformat} > ERROR regionserver.ReplicationSource: Unexpected exception in ReplicationSourceWorkerThread, currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) > at org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) > at org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) > at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) > at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) > at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) > at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) > at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)