hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment
Date Thu, 19 May 2011 00:25:47 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jean-Daniel Cryans resolved HBASE-3874.
---------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed to branch and trunk, thanks for the review Stack!

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3874-trunk.patch, HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started
dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't
completely ServerShutdownHandler'd because they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable
while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> 	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> 	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it seems we
instantiate a RegionPlan with a null HSI when it's a random assignment. 
> It means that if there's a random assignment going on while a node dies then this issue
might happen.
> Initially I thought that this could mean data loss, but the logs are already split so
it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the balancer
didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead
regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which removes
the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message