hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] Created: (HBASE-3041) [replication] ReplicationSink shouldn't kill the whole RS when it fails to replicate
Date Mon, 27 Sep 2010 17:15:38 GMT
[replication] ReplicationSink shouldn't kill the whole RS when it fails to replicate
------------------------------------------------------------------------------------

                 Key: HBASE-3041
                 URL: https://issues.apache.org/jira/browse/HBASE-3041
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.89.20100924
            Reporter: Jean-Daniel Cryans
            Assignee: Jean-Daniel Cryans
             Fix For: 0.90.0


This is kind of a funny bug, as long as you don't run into it. I thought I'd be a good idea
to kill the region servers that act as sinks when they can't replicate edits on their own
cluster (this is often something we do in face of fatal errors throughout the code), but not
so much.

So, last friday while I was using CopyTable to replicate data from a master to a slave cluster
while the new data was being replicated, one table got really slow and took too long to split
which tripped RetriesExhaustedException coming out of HTable in ReplicationSink. This killed
a first region server, which was itself hosting regions. Splitting the logs took a bit longer
since the cluster was under high insert load, so this triggered other exceptions in the other
region servers, to a point where they were all down. I restarted the cluster, the master splits
all the logs that were remaining and begins assigning regions. Some of them took too long
to open because each region server had a few regions to recover each and the last ones in
the queue were minutes from being opened. Since the master cluster was already pushing edits
to the slave, the region servers all got RetriesExhausted and all went down again. I changed
the client pause from 1 to 3 and restarted, same happened. I changed it to 5, and finally
was able to keep the cluster up. Fortunately, the master cluster was queueing up the HLogs
so we didn't lose any data and the backlog was replicated in a few minutes.

So, instead of killing the region server, any exception coming out of HTable should just be
treated as a failure to apply and the source cluster should retry later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message