Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B8D091051D for ; Thu, 1 Aug 2013 09:08:56 +0000 (UTC) Received: (qmail 71562 invoked by uid 500); 1 Aug 2013 09:08:55 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 71512 invoked by uid 500); 1 Aug 2013 09:08:51 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 71504 invoked by uid 99); 1 Aug 2013 09:08:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Aug 2013 09:08:49 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of huanfeng388@gmail.com designates 74.125.82.51 as permitted sender) Received: from [74.125.82.51] (HELO mail-wg0-f51.google.com) (74.125.82.51) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Aug 2013 09:08:44 +0000 Received: by mail-wg0-f51.google.com with SMTP id a12so1423067wgh.6 for ; Thu, 01 Aug 2013 02:08:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=LkV09/JvVQ1guvoEzxzkERcY22SS1AFl2Fkcf7nwBSI=; b=Yws6F/yk+JpBF0qerdAiFznV2CULP8hwP/C+CXpONuZgeRuHT7qZPwx2nB8F4lAQFT wIwPWIQl8nJNZD5q63yfLraxuMgoNg4sh9NDaXSWEvsUdeOpCruCymIpQ0YQwxga69P0 6jjEsCxgrz+ibU7izZwYOy9XDnyrNjkutkv9foVb9hnFHDz1BP/FHY7LVlhb806WEvZ4 JfzwY0u4K9vBC+qQWLj3jeqVsTukdzgN/Rxw3nj4Tflt8JcE7346irl+sEjEwWgqWwHj GtwJkHV+27lU2u6UQFy+L8/argOb1e7f7cFSiRHnT/7QeZ25fhLoiszWSNH1i6S6lW+r o1vQ== MIME-Version: 1.0 X-Received: by 10.180.38.37 with SMTP id d5mr7092534wik.37.1375348102797; Thu, 01 Aug 2013 02:08:22 -0700 (PDT) Received: by 10.194.167.98 with HTTP; Thu, 1 Aug 2013 02:08:22 -0700 (PDT) Date: Thu, 1 Aug 2013 17:08:22 +0800 Message-ID: Subject: Is flushOneForGlobalPressure too frequently? From: Mickey To: dev@hbase.apache.org Content-Type: multipart/alternative; boundary=e89a8f646c8b1c225704e2df2f79 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f646c8b1c225704e2df2f79 Content-Type: text/plain; charset=ISO-8859-1 Hi, all We met an extremely long gc time problem which make all the region servers go down. >From the region server's log, I found lots of logs as follows: 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., flushing=true, writesEnabled=true 2013-07-30 19:31:12,223 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. due to global heap pressure 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., flushing=true, writesEnabled=true 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=11.2g 2013-07-30 19:31:12,223 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. due to global heap pressure 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., flushing=true, writesEnabled=true 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=11.2g 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=11.2g 2013-07-30 19:31:12,223 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. due to global heap pressure 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., flushing=true, writesEnabled=true 2013-07-30 19:31:12,223 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. due to global heap pressure 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., flushing=true, writesEnabled=true 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=11.2g 2013-07-30 19:31:12,223 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. due to global heap pressure 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., flushing=true, writesEnabled=true 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=11.2g 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=11.2g 2013-07-30 19:31:12,223 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. due to global heap pressure 2013-07-30 19:31:12,223 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., flushing=true, writesEnabled=true 2013-07-30 19:31:12,224 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. due to global heap pressure 2013-07-30 19:31:12,224 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=11.2g 2013-07-30 19:31:12,224 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., flushing=true, writesEnabled=true 2013-07-30 19:31:12,224 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. due to global heap pressure It sometimes produce more than 1GB logs in one minute. And after some times, the region server will do a extremely long time gc after promotion fail. In our environment, there is a datanode with bad network condition which always lost packages. Region server flushed hfile very slowly when it try to flush file to the bad datanode. After remove the datanode, the situation turns good now. My guess: Too may flushOneForGlobalPressure in this situation. It produce too many Strings(logs) and other objects. I check the code in the MemStoreFlusher in hbase 0.94 branch(trunk seems the same): public void run() { while (!this.server.isStopped()) { FlushQueueEntry fqe = null; try { wakeupPending.set(false); // allow someone to wake us up again fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS); if (fqe == null || fqe instanceof WakeupFlushThread) { if (isAboveLowWaterMark()) { LOG.debug("Flush thread woke up because memory above low water=" + StringUtils.humanReadableInt(this.globalMemStoreLimitLowMark)); if (!flushOneForGlobalPressure()) { // Wasn't able to flush any region, but we're above low water mark // This is unlikely to happen, but might happen when closing the // entire server - another thread is flushing regions. We'll just // sleep a little bit to avoid spinning, and then pretend that // we flushed one, so anyone blocked will check again lock.lock(); try { Thread.sleep(1000); flushOccurred.signalAll(); } finally { lock.unlock(); } } // Enqueue another one of these tokens so we'll wake up again wakeupFlushThread(); } continue; } FlushRegionEntry fre = (FlushRegionEntry)fqe; if (!flushRegion(fre)) { break; } } catch (InterruptedException ex) { continue; } catch (ConcurrentModificationException ex) { continue; } catch (Exception ex) { LOG.error("Cache flusher failed for entry " + fqe, ex); if (!server.checkFileSystem()) { break; } } } ... } The while loop in this case will quickly finished and then again. Is my guess reasonable? If so, should we do a short sleep in the loop? Or there are some configurations we can set? I know if the region server is quite busy, it may affect the insert operations. In my opinion, blocking the insert operation is more acceptable than the server goes down. Hope to get your suggestions. Thanks, Mickey --e89a8f646c8b1c225704e2df2f79--