Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A3AC218720 for ; Thu, 16 Jul 2015 23:39:07 +0000 (UTC) Received: (qmail 58408 invoked by uid 500); 16 Jul 2015 23:39:07 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 58367 invoked by uid 500); 16 Jul 2015 23:39:07 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 58353 invoked by uid 99); 16 Jul 2015 23:39:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jul 2015 23:39:07 +0000 Date: Thu, 16 Jul 2015 23:39:07 +0000 (UTC) From: "Elliott Clark (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-13971) Flushes stuck since 6 hours on a regionserver. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630536#comment-14630536 ] Elliott Clark commented on HBASE-13971: --------------------------------------- This needs to be in branch-1 and branch-1.2 as well. This is a pretty important band aid fix until we know why creating the sequence id is stuck. > Flushes stuck since 6 hours on a regionserver. > ---------------------------------------------- > > Key: HBASE-13971 > URL: https://issues.apache.org/jira/browse/HBASE-13971 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 1.3.0 > Environment: Caused while running IntegrationTestLoadAndVerify for 20 M rows on cluster with 32 region servers each with max heap size of 24GBs. > Reporter: Abhilash > Assignee: Ted Yu > Priority: Critical > Attachments: 13971-v1.txt, 13971-v1.txt, 13971-v1.txt, jstack.1, jstack.2, jstack.3, jstack.4, jstack.5, rsDebugDump.txt, screenshot-1.png > > > One region server stuck while flushing(possible deadlock). Its trying to flush two regions since last 6 hours (see the screenshot). > Caused while running IntegrationTestLoadAndVerify for 20 M rows with 600 mapper jobs and 100 back references. ~37 Million writes on each regionserver till now but no writes happening on any regionserver from past 6 hours and their memstore size is zero(I dont know if this is related). But this particular regionserver has memstore size of 9GBs from past 6 hours. > Relevant snaps from debug dump: > Tasks: > =========================================================== > Task: Flushing IntegrationTestLoadAndVerify,R\x9B\x1B\xBF\xAE\x08\xD1\xA2,1435179555993.8e2d075f94ce7699f416ec4ced9873cd. > Status: RUNNING:Preparing to flush by snapshotting stores in 8e2d075f94ce7699f416ec4ced9873cd > Running for 22034s > Task: Flushing IntegrationTestLoadAndVerify,\x93\xA385\x81Z\x11\xE6,1435179555993.9f8d0e01a40405b835bf6e5a22a86390. > Status: RUNNING:Preparing to flush by snapshotting stores in 9f8d0e01a40405b835bf6e5a22a86390 > Running for 22033s > Executors: > =========================================================== > ... > Thread 139 (MemStoreFlusher.1): > State: WAITING > Blocked count: 139711 > Waited count: 239212 > Waiting on java.util.concurrent.CountDownLatch$Sync@b9c094a > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > org.apache.hadoop.hbase.wal.WALKey.getSequenceId(WALKey.java:305) > org.apache.hadoop.hbase.regionserver.HRegion.getNextSequenceId(HRegion.java:2422) > org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2168) > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2047) > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2011) > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1902) > org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1828) > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:510) > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471) > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75) > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259) > java.lang.Thread.run(Thread.java:745) > Thread 137 (MemStoreFlusher.0): > State: WAITING > Blocked count: 138931 > Waited count: 237448 > Waiting on java.util.concurrent.CountDownLatch$Sync@53f41f76 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > org.apache.hadoop.hbase.wal.WALKey.getSequenceId(WALKey.java:305) > org.apache.hadoop.hbase.regionserver.HRegion.getNextSequenceId(HRegion.java:2422) > org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2168) > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2047) > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2011) > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1902) > org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1828) > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:510) > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471) > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75) > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259) > java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)