hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerry He (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HBASE-16581) Optimize Replication queue transfers after server fail over
Date Thu, 08 Sep 2016 17:05:20 GMT

     [ https://issues.apache.org/jira/browse/HBASE-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jerry He resolved HBASE-16581.
------------------------------
    Resolution: Duplicate

> Optimize Replication queue transfers after server fail over
> -----------------------------------------------------------
>
>                 Key: HBASE-16581
>                 URL: https://issues.apache.org/jira/browse/HBASE-16581
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jerry He
>            Assignee: Jerry He
>
> Currently if a region server fails, the replication queue of this server will be picked
up by another region server.  The problem is this queue can possibly be huge and contains
queues from other multiple or cascading server failures.
> We had such a case in production.  From zk_dump:
> {code}
> ...
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467603735059:
18748267
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471723778060:

> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468258960080:

> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1468204958990:

> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469701010649:

> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1470409989238:

> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471838985073:

> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467142915090:
57804890
> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1472181000614:

> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1471464567365:

> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1469486466965:

> /hbase/replication/rs/r01data10-va-pub.xxx.ext,60020,1472680007498/1-r01data10-va-pub.xxx.ext,60020,1455993417125-r01data07-va-pub.xxx.ext,60020,1472680008225-r01data08-va-pub.xxx.ext,60020,1472680007318/r01data10-va-pub.xxx.ext%2C60020%2C1455993417125.1467787339841:
47812951
> ...
> {code}
> There were hundreds of wals hanging under this queue, coming from diferent region servers,
which took a long time to replicate.
> We should have a better strategy which lets live region servers each grep part of this
nested queue, and replicate in parallel.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message