hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HBASE-3041) [replication] ReplicationSink shouldn't kill the whole RS when it fails to replicate
Date Fri, 15 Oct 2010 00:21:33 GMT

     [ https://issues.apache.org/jira/browse/HBASE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jean-Daniel Cryans resolved HBASE-3041.
---------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed to trunk.

> [replication] ReplicationSink shouldn't kill the whole RS when it fails to replicate
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-3041
>                 URL: https://issues.apache.org/jira/browse/HBASE-3041
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100924
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3041.patch
>
>
> This is kind of a funny bug, as long as you don't run into it. I thought I'd be a good
idea to kill the region servers that act as sinks when they can't replicate edits on their
own cluster (this is often something we do in face of fatal errors throughout the code), but
not so much.
> So, last friday while I was using CopyTable to replicate data from a master to a slave
cluster while the new data was being replicated, one table got really slow and took too long
to split which tripped RetriesExhaustedException coming out of HTable in ReplicationSink.
This killed a first region server, which was itself hosting regions. Splitting the logs took
a bit longer since the cluster was under high insert load, so this triggered other exceptions
in the other region servers, to a point where they were all down. I restarted the cluster,
the master splits all the logs that were remaining and begins assigning regions. Some of them
took too long to open because each region server had a few regions to recover each and the
last ones in the queue were minutes from being opened. Since the master cluster was already
pushing edits to the slave, the region servers all got RetriesExhausted and all went down
again. I changed the client pause from 1 to 3 and restarted, same happened. I changed it to
5, and finally was able to keep the cluster up. Fortunately, the master cluster was queueing
up the HLogs so we didn't lose any data and the backlog was replicated in a few minutes.
> So, instead of killing the region server, any exception coming out of HTable should just
be treated as a failure to apply and the source cluster should retry later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message