hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Himanshu Vashishtha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9810) Global memstore size will be calculated wrongly if replaying recovered edits throws exception
Date Mon, 21 Oct 2013 17:54:42 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800876#comment-13800876
] 

Himanshu Vashishtha commented on HBASE-9810:
--------------------------------------------

+1 to the patch.

I don't think we should call clearRegionReplayEditsSize() in case of erroneous file with skip_errors
= true. 
Ideally, this call should be done when all the recovered edits file has been processed / or
error has been thrown and the region couldn't be opened (via the rollbackRegionReplayEditsSize()).
If we call in the condition you mentioned, then again there would be inaccurate accounting
of global memstore size in case we need to rollback (for what ever reason), as we would clear
up all the accounting of all the previous clean recovered edits files. If we have skip_errors
set to true, then we would call it after recovering all recovered edits files anyway. Please
correct if not so. Thanks.

> Global memstore size will be calculated wrongly if replaying recovered edits throws exception
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-9810
>                 URL: https://issues.apache.org/jira/browse/HBASE-9810
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0, 0.96.1
>            Reporter: chunhui shen
>            Assignee: chunhui shen
>            Priority: Critical
>         Attachments: hbase-9810-trunk.patch
>
>
> Recently we encountered such a case in 0.94-version:
> Flush is triggered frequently because:
> {noformat}DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke
up because memory above low water=14.4g
> {noformat}
> But, the real global memstore size is about 1g.
> It seems the global memstore size has been calculated wrongly.
> Through the logs, I find the following root cause log:
> {noformat}
> ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of
region=notifysub2_index,\x83\xDC^\xCD\xA3\x8A<\x
> E2\x8E\xE6\xAD!\xDC\xE8t\xED,1379148697072.46be7c2d71c555379278a7494df3015e., starting
to roll back the global memstore size.
> java.lang.NegativeArraySizeException
>         at org.apache.hadoop.hbase.KeyValue.getFamily(KeyValue.java:1096)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:2933)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:2811)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:583)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:499)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3939)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3887)
>         at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
>         at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
>         at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> {noformat}
> Browse the code of this part, seems a critial bug about global memstore size when replaying
recovered edits.
> (RegionServerAccounting#clearRegionReplayEditsSize is called  for each edit file, it
means the roll back size is smaller than actual when calling RegionServerAccounting#rollbackRegionReplayEditsSize)
> Anyway,  the solution is easy as the patch.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message