hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
Date Wed, 22 Jun 2011 21:58:50 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053493#comment-13053493
] 

Jean-Daniel Cryans commented on HBASE-3984:
-------------------------------------------

Now that I'm running the unit tests, I can see that the Master doesn't handle the case where
it's assigning a region to a region server that is shutting down. This will create some ripples,
I'll try to keep it contained on trunk.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server
not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region
servers weren't able to talk to the new .META. location because the old one was still alive
but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META.
and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta
-> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh.
In this method we check if the RS is good by calling getRegionInfo which *does not* check
if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it
has fully shutdown since every time a RS tries to open a region (like right after the log
splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message