hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-13330) Region left unassigned due to AM & SSH each thinking the assignment would be done by the other
Date Fri, 10 Apr 2015 22:16:12 GMT

     [ https://issues.apache.org/jira/browse/HBASE-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Devaraj Das updated HBASE-13330:
    Attachment: 13330-branch-1.txt

The meta had the location of the region 0d6cf37c18c54c6f4744750c6a7be837 as dnj1-bcpc-r3n8.example.com,60020,1425598187703
(which was dead). However, there existed an RIT node in the RS_ZK_REGION_FAILED_OPEN state
with dnj1-bcpc-r3n2.example.com,60020,1425603618259 as the servername (which was alive), and
hence the SSH for dnj1-bcpc-r3n8.example.com,60020,1425598187703 wouldn't recover this region
(as designed).
The region stays offline for ever after that, and activity happens on the region when dnj1-bcpc-r3n2.example.com,60020,1425603618259
expires. At that time, the SSH for dnj1-bcpc-r3n2.example.com,60020,1425603618259 finds the
region in an unexpected state "CLOSED". And, the regions stays unassigned for ever.
The patch attached makes it so that, at startup, for nodes with the state RS_ZK_REGION_FAILED_OPEN,
the servername from the RIT node is set as the last assigned server (overwrites the servername
obtained from the meta for the region in question). By doing this, the attempted assignment
should go through since the region won't be considered as belonging to the dead server anymore
(step (6) in the description of the ticket).

> Region left unassigned due to AM & SSH each thinking the assignment would be done
by the other
> ----------------------------------------------------------------------------------------------
>                 Key: HBASE-13330
>                 URL: https://issues.apache.org/jira/browse/HBASE-13330
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment
>            Reporter: Devaraj Das
>             Fix For: 1.1.0
>         Attachments: 13330-branch-1.txt
> Here is what I found during analysis of an issue. Raising this jira and a fix will follow.
> The TL;DR of this is that the AssignmentManager thinks the ServerShutdownHandler would
assign the region and the ServerShutdownHandler thinks that the AssignmentManager would assign
the region. The region (0d6cf37c18c54c6f4744750c6a7be837) ultimately never gets assigned.
Below is an analysis from the logs that captures the flow of events.
> 1. The AssignmentManager had initially assigned this region to dnj1-bcpc-r3n8.example.com,60020,1425598187703
> 2. When the master restarted it did a scan of the meta to learn about the regions in
the cluster. It found this region being assigned to dnj1-bcpc-r3n8.example.com,60020,1425598187703
from the meta record.
> 3. However, this server (dnj1-bcpc-r3n8.example.com,60020,1425598187703) was not alive
anymore. So, the AssignmentManager queued up a ServerShutdownHandling task for this (that
asynchronously executes):
> {noformat}
> 2015-03-06 14:09:31,355 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=dnj1-bcpc-r3n8.example.com,60020,1425598187703
to dead servers,
>  submitted shutdown handler to be executed meta=false
> {noformat}
> 4. The AssignmentManager proceeded to read the RIT nodes from ZK. It found this region
as well:
> {noformat}
> 2015-03-06 14:09:31,527 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing
> {noformat}
> 5. The region was moved to CLOSED state:
> {noformat}
> 2015-03-06 14:09:31,527 WARN org.apache.hadoop.hbase.master.RegionStates: 0d6cf37c18c54c6f4744750c6a7be837
moved to CLOSED on
> dnj1-bcpc-r3n2.example.com,60020,1425603618259, expected dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> Note the reference to dnj1-bcpc-r3n2.example.com,60020,1425603618259. This means that
the region was assigned to dnj1-bcpc-r3n2.example.com,60020,1425603618259 but that regionserver
couldn't open the region for some reason, and it changed the state to RS_ZK_REGION_FAILED_OPEN
in RIT znode on ZK.
> 6. After that the AssignmentManager tried to assign it again. However, the assignment
didn't happen because the ServerShutdownHandling task queued earlier didn't yet execute:
> {noformat}
> 2015-03-06 14:09:31,527 INFO org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning
>  it's host dnj1-bcpc-r3n8.example.com,60020,1425598187703 is dead but not processed yet
> {noformat}
> 7. Eventually the ServerShutdownHandling task executed.
> {noformat}
> 2015-03-06 14:09:35,188 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
Splitting logs for dnj1-bcpc-r3n8.example.com,60020,1425598187703 before assignment.
> 2015-03-06 14:09:35,209 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
Reassigning 19 region(s) that dnj1-bcpc-r3n8.example.com,60020,1425598187703 was
>  carrying (and 0 regions(s) that were opening on this server)
> 2015-03-06 14:09:35,211 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
Finished processing of shutdown of dnj1-bcpc-r3n8.example.com,60020,1425598187703
> {noformat}
> 8. However, the ServerShutdownHandling task skipped the region in question. This was
because this region was in RIT, and the ServerShutdownHandling task thinks that the AssignmentManager
would assign it as part of handling the RIT nodes:
> {noformat}
> 2015-03-06 14:09:35,210 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
Skip assigning region in transition on other server{0d6cf37c18c54c6f4744750c6a7be837
> state=CLOSED, ts=1425668971527, server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}
> 9. At some point in the future, when the server dnj1-bcpc-r3n2.example.com,60020,1425603618259
dies, the ServerShutdownHandling for it gets queued up (from the log hbase-hbase-master-dnj1-bcpc-r3n1.log):
> {noformat}
> 2015-03-09 11:35:10,607 INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
ephemeral node deleted,
> processing expiration [dnj1-bcpc-r3n2.example.com,60020,1425603618259]
> {noformat}
> 10. In RegionStates.java:serverOffline, there is a check that happens on the state of
the region's state. Since the region is in CLOSED state, the log is displayed:
> {noformat}
> 2015-03-09 11:35:15,711 WARN org.apache.hadoop.hbase.master.RegionStates: THIS SHOULD
NOT HAPPEN: unexpected {0d6cf37c18c54c6f4744750c6a7be837 state=CLOSED, ts=1425668971527, server=dnj1-bcpc-r3n2.example.com,60020,1425603618259}
> {noformat}

This message was sent by Atlassian JIRA

View raw message