Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 71EEE200BA2 for ; Sun, 2 Oct 2016 05:55:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6A5FD160AE7; Sun, 2 Oct 2016 03:55:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A73DD160AD7 for ; Sun, 2 Oct 2016 05:55:21 +0200 (CEST) Received: (qmail 27594 invoked by uid 500); 2 Oct 2016 03:55:20 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 27566 invoked by uid 99); 2 Oct 2016 03:55:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Oct 2016 03:55:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7734E2C0B04 for ; Sun, 2 Oct 2016 03:55:20 +0000 (UTC) Date: Sun, 2 Oct 2016 03:55:20 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sun, 02 Oct 2016 03:55:22 -0000 [ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15539677#comment-15539677 ] Hudson commented on HBASE-16721: -------------------------------- FAILURE: Integrated in Jenkins build HBase-1.3-JDK7 #27 (See [https://builds.apache.org/job/HBase-1.3-JDK7/27/]) HBASE-16721 Concurrency issue in WAL unflushed seqId tracking - ADDENDUM (enis: rev 3ddff3b665dbf83f138f8ab19b4e79f391ff71a8) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WAL.java > Concurrency issue in WAL unflushed seqId tracking > ------------------------------------------------- > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Components: wal > Affects Versions: 1.0.0, 1.1.0, 1.2.0 > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Priority: Critical > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8 > > Attachments: hbase-16721_addendum.patch, hbase-16721_v1.branch-1.patch, hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch > > > I'm inspecting an interesting case where in a production cluster, some regionservers ends up accumulating hundreds of WAL files, even with force flushes going on due to max logs. This happened multiple times on the cluster, but not on other clusters. The cluster has periodic memstore flusher disabled, however, this still does not explain why the force flush of regions due to max limit is not working. I think the periodic memstore flusher just masks the underlying problem, which is why we do not see this in other clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] regionserver.LogRoller: Failed to schedule flush of d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] regionserver.LogRoller: Failed to schedule flush of d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already split some time ago, and was able to flush its data and split without any problems. However, the FSHLog still thinks that there is some unflushed data for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)