Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 26 May 2016 00:23:12 +0000 (UTC)
From: "Enis Soztutar (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12970093.1463422404000.296988.1464222192905@Atlassian.JIRA>
In-Reply-To: <JIRA.12970093.1463422404000@Atlassian.JIRA>
References: <JIRA.12970093.1463422404000@Atlassian.JIRA> <JIRA.12970093.1463422404699@arcas>
Subject: [jira] [Updated] (HBASE-15837) Memstore size accounting is wrong if
 postBatchMutate() throws exception
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 26 May 2016 00:23:19 -0000


     [ https://issues.apache.org/jira/browse/HBASE-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HBASE-15837:
----------------------------------
    Attachment: hbase-15837.branch-1.patch

branch-1 patch. I think we need this backported to all active. 

> Memstore size accounting is wrong if postBatchMutate() throws exception
> -----------------------------------------------------------------------
>
>                 Key: HBASE-15837
>                 URL: https://issues.apache.org/jira/browse/HBASE-15837
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 2.0.0, 1.3.0, 1.2.2, 1.1.6
>
>         Attachments: HBASE-15837.001.patch, hbase-15837-v1.patch, hbase-15837.branch-1.patch, hbase-memstore-size-accounting.patch
>
>
> Over in PHOENIX-2883, I've been trying to figure out how to track down the root cause of an issue we were seeing where a negative memstoreSize was ultimately causing an RS to abort. The tl;dr version is
> * Something causes memstoreSize to be negative (not sure what is doing this yet)
> * All subsequent flushes short-circuit and don't run because they think there is no data to flush
> * The region is eventually closed (commonly, for a move).
> * A final flush is attempted on each store before closing (which also short-circuit for the same reason), leaving unflushed data in each store.
> * The sanity check that each store's size is zero fails and the RS aborts.
> I have a little patch which I think should improve our failure case around this, preventing the RS abort safely (forcing a flush when memstoreSize is negative) and logging a calltrace when an update to memstoreSize make it negative (to find culprits in the future).


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)