hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ryan rawson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2940) Improve behavior under partial failure of region servers
Date Sun, 29 Aug 2010 20:43:53 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904033#action_12904033
] 

ryan rawson commented on HBASE-2940:
------------------------------------

I think the primary mechanism of shutdown/termination should be via the hlog
block. The master should close the logfile then reassign regions. Since the
hlog is gone any operations that were successful would terminate and the
reassignment would prevent new clients from talking to the dead server.

heartbeating) but unable to actually service requests properly (or at a
reasonable speed). This can happen for any number of reasons including:
slowly than expected
libs) so it fails to correctly open regions, perform flushes, etc.
This is useful if the region server is up but for some reason the admin
can't ssh in to shut it down (eg the root disk has failed). This feature
would allow the admin to issue a command that will shut down any given RS.
the script returns an error code, the RS could shut itself down gracefully
and report an error message on the master console.
would be useful for monitoring, and we could add heuristics to automatically
shut down region servers if they have an elevated error count over some
period of time.


> Improve behavior under partial failure of region servers
> --------------------------------------------------------
>
>                 Key: HBASE-2940
>                 URL: https://issues.apache.org/jira/browse/HBASE-2940
>             Project: HBase
>          Issue Type: New Feature
>          Components: master, regionserver
>            Reporter: Todd Lipcon
>
> On larger clusters, we often see failure cases where a server is "up" (ie heartbeating)
but unable to actually service requests properly (or at a reasonable speed). This can happen
for any number of reasons including:
> - failing disks or disk controllers respond, but do so very slowly
> - the machine is swapping, so everything is still running but much more slowly than expected
> - HBase or the DN on the machine has been misconfigured (eg missing lzo libs) so it fails
to correctly open regions, perform flushes, etc.
> Here are a few proposed features that are worth considering:
> 1) Add a "blacklist" or "remote shutdown" functionality to the master. This is useful
if the region server is up but for some reason the admin can't ssh in to shut it down (eg
the root disk has failed). This feature would allow the admin to issue a command that will
shut down any given RS.
> 2) Periodically run a "health check" script on the region server node. If the script
returns an error code, the RS could shut itself down gracefully and report an error message
on the master console.
> 3) Allow clients to report back RS-specific errors to the master. This would be useful
for monitoring, and we could add heuristics to automatically shut down region servers if they
have an elevated error count over some period of time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message