hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "huzheng (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
Date Thu, 05 Jan 2017 14:22:58 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801468#comment-15801468
] 

huzheng edited comment on HBASE-17381 at 1/5/17 2:22 PM:
---------------------------------------------------------

[~ghelmling] I upload a patch to abort region server if OOME occur.  for other exception cases,
 I throw them (ReplicationSourceWorkerThread exit and region server keep running) because
it seems hard to make sure  whether  it's recoverable case or not .


was (Author: openinx):
[~ghelmling] I upload a patch to abort region server if OOME occur.  for other exception cases,
 I throw them because it seems hard to make sure  whether  it's recoverable case or not (ReplicationSourceWorkerThread
exit and region server keep running).

> ReplicationSourceWorkerThread can die due to unhandled exceptions
> -----------------------------------------------------------------
>
>                 Key: HBASE-17381
>                 URL: https://issues.apache.org/jira/browse/HBASE-17381
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Gary Helmling
>         Attachments: HBASE-17381.patch
>
>
> If a ReplicationSourceWorkerThread encounters an unexpected exception in the run() method
(for example failure to allocate direct memory for the DFS client), the exception will be
logged by the UncaughtExceptionHandler, but the thread will also die and the replication queue
will back up indefinitely until the Regionserver is restarted.
> We should make sure the worker thread is resilient to all exceptions that it can actually
handle.  For those that it really can't, it seems better to abort the regionserver rather
than just allow replication to stop with minimal signal.
> Here is a sample exception:
> {noformat}
> ERROR regionserver.ReplicationSource: Unexpected exception in ReplicationSourceWorkerThread,
currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:693)
> at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> at org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:96)
> at org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:113)
> at org.apache.hadoop.crypto.CryptoOutputStream.<init>(CryptoOutputStream.java:108)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
> at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> at org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92)
> at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
> at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
> at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695)
> at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356)
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
> at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
> at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
> at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
> at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
> at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message