hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
Date Wed, 29 Jun 2011 22:29:33 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jean-Daniel Cryans resolved HBASE-3984.
---------------------------------------

      Resolution: Fixed
    Release Note: 
In trunk:
All HRegionInferface methods will now throw a RegionServerStoppedException if it's in that
state, whereas we used to only check it for a few methods.
SingleServerBulkAssigner will not kill the Master anymore when getting IOEs, instead it will
just log an error and the TimeoutMonitor will take care of picking up the pieces.

In 0.90:
Only a couple of checkOpen calls were added in order to change as less code as possible while
still fixing the issue.
    Hadoop Flags: [Reviewed]

Commmitted the 0.90 patch to branch and the other patch to trunk including the fix that Ted
pointed to. Thanks guys for the reviews.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk-v2.patch,
HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server
not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region
servers weren't able to talk to the new .META. location because the old one was still alive
but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META.
and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta
-> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh.
In this method we check if the RS is good by calling getRegionInfo which *does not* check
if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it
has fully shutdown since every time a RS tries to open a region (like right after the log
splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message