Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 1 Feb 2017 23:05:51 +0000 (UTC)
From: "Gary Helmling (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.13030799.1482884402000.11809.1485990351699@Atlassian.JIRA>
In-Reply-To: <JIRA.13030799.1482884402000@Atlassian.JIRA>
References: <JIRA.13030799.1482884402000@Atlassian.JIRA> <JIRA.13030799.1482884402649@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HBASE-17381) ReplicationSourceWorkerThread can
 die due to unhandled exceptions
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 01 Feb 2017 23:06:05 -0000


    [ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849077#comment-15849077 ] 

Gary Helmling commented on HBASE-17381:
---------------------------------------

[~openinx] on the current patches:

* since we are only aborting/stopping the regionserver, we can continue to use UncaughtExceptionHandler for this purpose.  Since we already create and attach an UncaughtExceptionHandler in ReplicationSourceWorkerThread.startup(), this seems like the right place to fix it.
* in the case of an OOME (as checked for in your initial patch), it seems fine to use Runtime.halt().  However this is pretty extreme in any other cases
* for other uncaught exceptions, it would be better to use Stoppable.stop(String reason).   A Stoppable instance (the regionserver) is passed through to ReplicationSourceManager.  We can use this instance to create a UEH that calls Stoppable.stop() if the exception we encounter is not a OOME.  This will give regions a chance to close cleanly, etc. and speed recovery.

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -----------------------------------------------------------------
>
>                 Key: HBASE-17381
>                 URL: https://issues.apache.org/jira/browse/HBASE-17381
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Gary Helmling
>            Assignee: huzheng
>         Attachments: HBASE-17381.patch, HBASE-17381.v1.patch, HBASE-17381.v2.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the run() method (for example failure to allocate direct memory for the DFS client), the exception will be logged by the UncaughtExceptionHandler, but the thread will also die and the replication queue will back up indefinitely until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it can actually handle.  For those that it really can't, it seems better to abort the regionserver rather than just allow replication to stop with minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in ReplicationSourceWorkerThread, currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:96)
> at org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:113)
> at org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:108)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)