hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4033) The shutdown RegionServer could be added to AssignmentManager.servers again
Date Mon, 27 Jun 2011 19:09:47 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055701#comment-13055701
] 

Jean-Daniel Cryans commented on HBASE-4033:
-------------------------------------------

Nice find Jieshan.

So currently it seems that the only places where we call addToServers is regionOnline and
rebuildUserRegions meaning that the code currently relies on being to add the server inside
regionOnline once the cluster is started. The fact that regionOnline is called async from
the moment it happened makes it harder to manage if a RS is gone.

It seems we should only add the server when it's actually started and remove it when it's
dead.

Also we should consider clearing the queues that have references to something that's now stale...
but that might be a lot harder to do.

> The shutdown RegionServer could be added to AssignmentManager.servers again
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-4033
>                 URL: https://issues.apache.org/jira/browse/HBASE-4033
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: A_hbase-root-master-167-6-1-11.rar, analysis.gif
>
>
> The folling steps can easily recreate the problem:
> 1. There's thousands of regions in the cluster.
> 2. Stop the cluster.
> 3. Start the cluster. Killing one regionserver while the regions were opening. Restarted
it after 10 seconds.
> The shutted regionserver will appear in the AssignmentManager.servers list again.
> For example:
> Issue 1:
> 2011-06-23 14:14:30,775 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: Server information:
167-6-1-12,20020,1308803390123=2220, 167-6-1-13,20020,1308803391742=2374, 167-6-1-11,20020,1308803386333=2205,
167-6-1-13,20020,1308803514394=2183
> Two regionservers(One of it had aborted) had the same hostname but different startcode:
> 167-6-1-13,20020,1308803391742=2374
> 167-6-1-13,20020,1308803514394=2183
> Issue 2:
> (1).The Rs 167-6-1-11,20020,1308105402003 finished shutdown at "10:46:37,774":
> 10:46:37,774 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished
processing of shutdown of 167-6-1-11,20020,1308105402003
> (2).Overwriting happened, it seemed the RS was still exist in the set of AssignmentManager#regions:
> 10:45:55,081 WARN org.apache.hadoop.hbase.master.AssignmentManager: Overwriting 612342de1fe4733f72299d70addb6d11
on serverName=167-6-1-11,20020,1308105402003, load=(requests=0, regions=0, usedHeap=0, maxHeap=0)
> (3).Region was assigned to this dead RS again at "10:50:20,671":
> 10:50:20,671 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region
Jeason10,08058613800000030,1308032774777.612342de1fe4733f72299d70addb6d11. to 167-6-1-11,20020,1308105402003

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message