hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-15837) More gracefully handle a negative memstoreSize
Date Mon, 16 May 2016 19:02:12 GMT

     [ https://issues.apache.org/jira/browse/HBASE-15837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Elser updated HBASE-15837:
-------------------------------
    Attachment: HBASE-15837.001.patch

.001 A first stab at avoiding the RS crash. The general goals are to

# Determine who screwed up the memstoreSize in the first place
# Avoid data loss when memstoreSize is wrong

If a store does fail to flush successfully, the RS should still crash. The logic is just fixing
the logic so that memstoreSize being negative doesn't prevent a Store's flush and cause the
RS abort.

> More gracefully handle a negative memstoreSize
> ----------------------------------------------
>
>                 Key: HBASE-15837
>                 URL: https://issues.apache.org/jira/browse/HBASE-15837
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 2.0.0
>
>         Attachments: HBASE-15837.001.patch
>
>
> Over in PHOENIX-2883, I've been trying to figure out how to track down the root cause
of an issue we were seeing where a negative memstoreSize was ultimately causing an RS to abort.
The tl;dr version is
> * Something causes memstoreSize to be negative (not sure what is doing this yet)
> * All subsequent flushes short-circuit and don't run because they think there is no data
to flush
> * The region is eventually closed (commonly, for a move).
> * A final flush is attempted on each store before closing (which also short-circuit for
the same reason), leaving unflushed data in each store.
> * The sanity check that each store's size is zero fails and the RS aborts.
> I have a little patch which I think should improve our failure case around this, preventing
the RS abort safely (forcing a flush when memstoreSize is negative) and logging a calltrace
when an update to memstoreSize make it negative (to find culprits in the future).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message