hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Himanshu Vashishtha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6165) Replication can overrun .META scans on cluster re-start
Date Thu, 09 Aug 2012 18:03:19 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432029#comment-13432029

Himanshu Vashishtha commented on HBASE-6165:

[~eclark]: I used custom, because the current naming scheme is not appropriate in my opinion
(I started with medium/semi QOS, but then changed it to Custom). Using priority is kind of
a misnomer as there is no priority as such, its just different set of handlers that is serving
the requests.
Though we call them priorityHandlers, etc, they are just like regular handlers but for meta
operations. I think we should change their name to metaOpsHandlers (or metaHandlers). Yea,
I just used a threshold b/w 0 and 10.

bq. Since this starts 0 "custom" priority handlers by default it will add another undocumented
step when enabling replication. We should either make the number of handlers start by default
> 0, or have the number depend on if replication is enabled.
I am ok with >0 default; don't think it should be tied to replication as they can be used
for other methods too (such as Security, etc)

bq. The naming is weird. These are not "Custom"QOS, but "Medium"QOS methods, right?
Hope you find it rationale now.

bq. By default now (if hbase.regionserver.custom.priority.handler.count is not set), replicateWALEntry
would use non-priority handlers... Which is not right, I think. It should revert back to the
current behavior in that case (which is to do use the priorityQOS.
default > 0 sounds good?

bq. What I still do not understand... Does this problem always happen? Does it happen because
replicateWALEntry takes too long to finish? Does this only happen when the slave is already
degraded for other reasons? Should we also work on replicateWALEntry failing faster in case
of problems (shorter/fewer retries, etc)?

It can occur when the slave cluster is slow. And whenever it happens, it will make the entire
cluster unresponsive. I have a patch which adds the fail fast behavior in sink and has been
testing it too. It looks good so far. I tried creating a new JIRA but IOE while creating it
(see INFRA-5131). Will attach the patch once its created.
> Replication can overrun .META scans on cluster re-start
> -------------------------------------------------------
>                 Key: HBASE-6165
>                 URL: https://issues.apache.org/jira/browse/HBASE-6165
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Elliott Clark
>         Attachments: HBase-6165-v1.patch
> When restarting a large set of regions on a reasonably small cluster the replication
from another cluster tied up every xceiver meaning nothing could be onlined.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message