hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment
Date Thu, 19 May 2011 00:23:47 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jean-Daniel Cryans updated HBASE-3874:

    Attachment: HBASE-3874-trunk.patch

Patch for trunk.

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>         Attachments: HBASE-3874-trunk.patch, HBASE-3874.patch
> By chance, we were able to revert the ulimit on one of our clusters to 1024 and it started
dying non-stop on "Too many open files". Now the bad thing is that some region servers weren't
completely ServerShutdownHandler'd because they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable
while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> 	at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> 	at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it seems we
instantiate a RegionPlan with a null HSI when it's a random assignment. 
> It means that if there's a random assignment going on while a node dies then this issue
might happen.
> Initially I thought that this could mean data loss, but the logs are already split so
it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the balancer
didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead
regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which removes
the RS from the dead list if it comes back.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message