Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Message-ID: <23873037.423621285607738137.JavaMail.jira@thor>
Date: Mon, 27 Sep 2010 13:15:38 -0400 (EDT)
From: "Jean-Daniel Cryans (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Subject: [jira] Created: (HBASE-3041) [replication] ReplicationSink
 shouldn't kill the whole RS when it fails to replicate
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

[replication] ReplicationSink shouldn't kill the whole RS when it fails to =
replicate
---------------------------------------------------------------------------=
---------

                 Key: HBASE-3041
                 URL: https://issues.apache.org/jira/browse/HBASE-3041
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.89.20100924
            Reporter: Jean-Daniel Cryans
            Assignee: Jean-Daniel Cryans
             Fix For: 0.90.0


This is kind of a funny bug, as long as you don't run into it. I thought I'=
d be a good idea to kill the region servers that act as sinks when they can=
't replicate edits on their own cluster (this is often something we do in f=
ace of fatal errors throughout the code), but not so much.

So, last friday while I was using CopyTable to replicate data from a master=
 to a slave cluster while the new data was being replicated, one table got =
really slow and took too long to split which tripped RetriesExhaustedExcept=
ion coming out of HTable in ReplicationSink. This killed a first region ser=
ver, which was itself hosting regions. Splitting the logs took a bit longer=
 since the cluster was under high insert load, so this triggered other exce=
ptions in the other region servers, to a point where they were all down. I =
restarted the cluster, the master splits all the logs that were remaining a=
nd begins assigning regions. Some of them took too long to open because eac=
h region server had a few regions to recover each and the last ones in the =
queue were minutes from being opened. Since the master cluster was already =
pushing edits to the slave, the region servers all got RetriesExhausted and=
 all went down again. I changed the client pause from 1 to 3 and restarted,=
 same happened. I changed it to 5, and finally was able to keep the cluster=
 up. Fortunately, the master cluster was queueing up the HLogs so we didn't=
 lose any data and the backlog was replicated in a few minutes.

So, instead of killing the region server, any exception coming out of HTabl=
e should just be treated as a failure to apply and the source cluster shoul=
d retry later.

--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.