hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15837) More gracefully handle a negative memstoreSize
Date Mon, 16 May 2016 18:59:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285082#comment-15285082

Josh Elser commented on HBASE-15837:

bq. Crashing when holding data that's unexpected seems like the correct thing to do

Without looking at the code, I would have agreed with you; however, after taking a look at
how it's written I think it's just bad accounting. The check is written to verify that the
flush that we tried to run after grabbing the writeLock actually ran successfully (e.g. there
should be no chance that any more data exists). The fact that we're using {{memstoreSize}}
as the judge of whether or not to run actually run the flush, but then checking the size of
each Store seems goofy as well (leading us to this split on the truth).

Given that coprocessors could be loaded which could unintentionally mess things up (not to
mention internal bugs), forcing down the RS seems very invasive to me. I'm attaching a patch
once I finish typing this -- let me know what you think. IMO, this feels pretty safe to me
given that we know we're controlling all access to the region at this point in time.

> More gracefully handle a negative memstoreSize
> ----------------------------------------------
>                 Key: HBASE-15837
>                 URL: https://issues.apache.org/jira/browse/HBASE-15837
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 2.0.0
> Over in PHOENIX-2883, I've been trying to figure out how to track down the root cause
of an issue we were seeing where a negative memstoreSize was ultimately causing an RS to abort.
The tl;dr version is
> * Something causes memstoreSize to be negative (not sure what is doing this yet)
> * All subsequent flushes short-circuit and don't run because they think there is no data
to flush
> * The region is eventually closed (commonly, for a move).
> * A final flush is attempted on each store before closing (which also short-circuit for
the same reason), leaving unflushed data in each store.
> * The sanity check that each store's size is zero fails and the RS aborts.
> I have a little patch which I think should improve our failure case around this, preventing
the RS abort safely (forcing a flush when memstoreSize is negative) and logging a calltrace
when an update to memstoreSize make it negative (to find culprits in the future).

This message was sent by Atlassian JIRA

View raw message