hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9737) Corrupt HFile cause resource leak leading to Region Server OOM
Date Wed, 23 Oct 2013 01:47:41 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802515#comment-13802515

Hudson commented on HBASE-9737:

FAILURE: Integrated in HBase-0.94 #1180 (See [https://builds.apache.org/job/HBase-0.94/1180/])
HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534855)
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java

> Corrupt HFile cause resource leak leading to Region Server OOM
> --------------------------------------------------------------
>                 Key: HBASE-9737
>                 URL: https://issues.apache.org/jira/browse/HBASE-9737
>             Project: HBase
>          Issue Type: Bug
>          Components: HFile
>    Affects Versions: 0.94.12
>            Reporter: Aditya Kishore
>            Assignee: Aditya Kishore
>            Priority: Critical
>             Fix For: 0.98.0, 0.94.13, 0.96.1
>         Attachments: 9737.096.txt, HBASE-9737_0.94.patch, HBASE-9737_0.94.patch, HBASE-9737_0.94.patch,
HBASE-9737.patch, HBASE-9737.patch
> One of our customer was recently hit with OOM error on almost all of the region servers.
> Postmortem of the issue reveled that a corrupt HFile had made its way into one of the
regions which resulted into the region brought offline immediately which is as per design.
> What happened next reveals two issues:\\
> \\
> * As soon as the region was offlined, Master noticed this and tried to assign the region
to another region server which of course failed (again due to the corrupt HFile) and then
Master tried to assign this to another and so on. So this region kept bouncing from one server
to another and this went unnoticed for few hours and all region servers log were filled with
thousands of this message:{noformat}org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Failed open of
> region=userdata,50743646010,1378139055806.318c533716869574f10615703269497f.,
> starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException:
> org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
> Trailer from file
> /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
>         at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:550)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:463)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3835)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3783)
>         at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
>         at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
>         at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException:
> org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
> Trailer from file
> /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:404)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:257)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3017)
>         at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:525)
>         at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:523)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> {noformat} For situation like this, the region should be marked "offlined_with_error"
or something similar so that Master does not try to assign it to another server without user
fixing the issue. I will create a separate JIRA for that.
> * The second problem and the scope of this JIRA is that the function {{org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion()}}
throws exception without closing the {{FSDataInputStream}} objects even if closeIStream is
set to true. This lead to orphan filesystem streams accumulating in region server and it eventually
died of OOM.

This message was sent by Atlassian JIRA

View raw message