Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 21 Oct 2013 17:54:42 +0000 (UTC)
From: "Himanshu Vashishtha (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12674730.1382324104252.98590.1382378082341@arcas>
In-Reply-To: <JIRA.12674730.1382324104252@arcas>
References: <JIRA.12674730.1382324104252@arcas>
Subject: [jira] [Commented] (HBASE-9810) Global memstore size will be
 calculated wrongly if replaying recovered edits throws exception
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800876#comment-13800876 ] 

Himanshu Vashishtha commented on HBASE-9810:
--------------------------------------------

+1 to the patch.

I don't think we should call clearRegionReplayEditsSize() in case of erroneous file with skip_errors = true. 
Ideally, this call should be done when all the recovered edits file has been processed / or error has been thrown and the region couldn't be opened (via the rollbackRegionReplayEditsSize()). If we call in the condition you mentioned, then again there would be inaccurate accounting of global memstore size in case we need to rollback (for what ever reason), as we would clear up all the accounting of all the previous clean recovered edits files. If we have skip_errors set to true, then we would call it after recovering all recovered edits files anyway. Please correct if not so. Thanks.

> Global memstore size will be calculated wrongly if replaying recovered edits throws exception
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-9810
>                 URL: https://issues.apache.org/jira/browse/HBASE-9810
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0, 0.96.1
>            Reporter: chunhui shen
>            Assignee: chunhui shen
>            Priority: Critical
>         Attachments: hbase-9810-trunk.patch
>
>
> Recently we encountered such a case in 0.94-version:
> Flush is triggered frequently because:
> {noformat}DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=14.4g
> {noformat}
> But, the real global memstore size is about 1g.
> It seems the global memstore size has been calculated wrongly.
> Through the logs, I find the following root cause log:
> {noformat}
> ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=notifysub2_index,\x83\xDC^\xCD\xA3\x8A<\x
> E2\x8E\xE6\xAD!\xDC\xE8t\xED,1379148697072.46be7c2d71c555379278a7494df3015e., starting to roll back the global memstore size.
> java.lang.NegativeArraySizeException
>         at org.apache.hadoop.hbase.KeyValue.getFamily(KeyValue.java:1096)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:2933)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:2811)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:583)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:499)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3939)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3887)
>         at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
>         at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
>         at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> {noformat}
> Browse the code of this part, seems a critial bug about global memstore size when replaying recovered edits.
> (RegionServerAccounting#clearRegionReplayEditsSize is called  for each edit file, it means the roll back size is smaller than actual when calling RegionServerAccounting#rollbackRegionReplayEditsSize)
> Anyway,  the solution is easy as the patch.


--
This message was sent by Atlassian JIRA
(v6.1#6144)