hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3809) .META. may not come back online if > number of executors servers crash and one of those > number of executors was carrying meta
Date Fri, 22 Apr 2011 20:50:05 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13023392#comment-13023392
] 

Jean-Daniel Cryans commented on HBASE-3809:
-------------------------------------------

In the same vein (having to rely on .META. for region server shutdown), we saw an issue yesterday
where the balancer started just before a region server was cleanly shutdown. In sequence:

 - Balancer starts unassigning regions
 - RS starts closing a few regions for balancing
 - RS is told to stop
 - Master initiates the region server shutdown handler which scans .META. for regions that
are on that region server
 - Regions are being unassigned and moved while the master force unassigns regions that (he
thinks) are on the RS
 - At the end, 25 out of 500 regions are double assigned because they were already reassigned
when the server shutdown reassigns them.

This happens because the master relies on potentially stale information when forcing the unassign.
According to the comments in the code, we still have to scan to check against splits. The
workaround is to disable the balancer before shutting down a region server (like rolling restart
does).

hbck fixed the double assignment without any trouble.

> .META. may not come back online if > number of executors servers crash and one of
those > number of executors was carrying meta
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3809
>                 URL: https://issues.apache.org/jira/browse/HBASE-3809
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> This is a duplicate of another issue but at the moment I cannot find the original.
> If you had a 700 node cluster and then you ran something on the cluster which killed
100 nodes, and .META. had been running on one of those downed nodes, well, you'll have all
of your master executors processing ServerShutdowns and more than likely non of the currently
processing executors will be servicing the shutdown of the server that was carrying .META.
> Well, for server shutdown to complete at the moment, an online .META. is required.  So,
in the above case, we'll be stuck. The current executors will not be able to clear to make
space for the processing of the server carrying .META. because they need .META. to complete.
> We can make the master handlers have no bound so it will expand to accomodate all crashed
servers -- so it'll have the one .META. in its queue -- or we can change it so shutdown handling
doesn't require .META. to be on-line (its used to figure the regions the server was carrying);
we could use the master's in-memory picture of the cluster (But IIRC, there may be holes ....TBD)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message