Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9BFC1200B71 for ; Wed, 17 Aug 2016 05:35:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9AA85160AA8; Wed, 17 Aug 2016 03:35:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E2AD4160ABC for ; Wed, 17 Aug 2016 05:35:22 +0200 (CEST) Received: (qmail 4091 invoked by uid 500); 17 Aug 2016 03:35:21 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 3520 invoked by uid 99); 17 Aug 2016 03:35:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Aug 2016 03:35:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 122AA2C02A7 for ; Wed, 17 Aug 2016 03:35:21 +0000 (UTC) Date: Wed, 17 Aug 2016 03:35:21 +0000 (UTC) From: "Yu Li (JIRA)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-16429) FSHLog: deadlock if rollWriter called when ring buffer filled with appends MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 17 Aug 2016 03:35:23 -0000 Yu Li created HBASE-16429: ----------------------------- Summary: FSHLog: deadlock if rollWriter called when ring buffer filled with appends Key: HBASE-16429 URL: https://issues.apache.org/jira/browse/HBASE-16429 Project: HBase Issue Type: Bug Affects Versions: 1.2.2, 1.1.2 Reporter: Yu Li Assignee: Yu Li Priority: Critical Recently we experienced an online problem that all handlers are stuck. Checking the jstack we could see all handler threads waiting for RingBuffer.next, while the single ring buffer consumer dead waiting for {{safePointReleasedLatch}} to count down: {noformat} Normal handler thread: "B.defaultRpcServer.handler=126,queue=9,port=16020" daemon prio=10 tid=0x00007efd4b44f800 nid=0x15f29 runnable [0x00007efd3db7b000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:349) at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136) at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105) at com.lmax.disruptor.RingBuffer.next(RingBuffer.java:246) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.append(FSHLog.java:1222) at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3188) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2879) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2819) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:736) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:698) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2095) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:774) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:102) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108) at java.lang.Thread.run(Thread.java:756) RingBufferEventHandler thread waiting for safePointReleasedLatch: "regionserver/hadoop0369.et2.tbsite.net/11.251.152.226:16020.append-pool2-t1" prio=10 tid=0x00007efd320d0000 nid=0x1777b waiting on condition [0x00007efd2d2fa000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f01b48d9178> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.safePointAttained(FSHLog.java:1866) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2066) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:2029) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1909) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:756) {noformat} {{FSHLog#replaceWriter}} will call {{SafePointZigZagLatch#releaseSafePoint}} to count down {{safePointReleasedLatch}}, but replaceWriter got stuck when trying to publish a sync onto ring buffer: {noformat} "regionserver/hadoop0369.et2.tbsite.net/11.251.152.226:16020.logRoller" daemon prio=10 tid=0x00007efd320c8800 nid=0x16123 runnable [0x00007efd311f6000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:349) at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136) at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105) at com.lmax.disruptor.RingBuffer.next(RingBuffer.java:246) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncOnRingBuffer(FSHLog.java:1481) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncOnRingBuffer(FSHLog.java:1477) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:957) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:726) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148) at java.lang.Thread.run(Thread.java:756) {noformat} Thus deadlock happens. A brief process of how deadlock forms: {noformat} ring buffer filled with appends -> rollWriter happens -> the only consumer of ring buffer waiting for safePointReleasedLatch -> rollWriter cannot publish sync since ring buffer is full -> rollWriter won't release safePointReleasedLatch {noformat} This JIRA targeting at resolve this issue, and will add a UT to cover the case -- This message was sent by Atlassian JIRA (v6.3.4#6332)