Return-Path: Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: (qmail 59013 invoked from network); 27 Sep 2010 17:16:03 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Sep 2010 17:16:03 -0000 Received: (qmail 38759 invoked by uid 500); 27 Sep 2010 17:16:03 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 38466 invoked by uid 500); 27 Sep 2010 17:16:02 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 38409 invoked by uid 99); 27 Sep 2010 17:16:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Sep 2010 17:16:02 +0000 X-ASF-Spam-Status: No, hits=-1996.4 required=10.0 tests=ALL_TRUSTED,FS_REPLICA X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Sep 2010 17:16:00 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o8RHFcL2006673 for ; Mon, 27 Sep 2010 17:15:38 GMT Message-ID: <23873037.423621285607738137.JavaMail.jira@thor> Date: Mon, 27 Sep 2010 13:15:38 -0400 (EDT) From: "Jean-Daniel Cryans (JIRA)" To: issues@hbase.apache.org Subject: [jira] Created: (HBASE-3041) [replication] ReplicationSink shouldn't kill the whole RS when it fails to replicate MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [replication] ReplicationSink shouldn't kill the whole RS when it fails to = replicate ---------------------------------------------------------------------------= --------- Key: HBASE-3041 URL: https://issues.apache.org/jira/browse/HBASE-3041 Project: HBase Issue Type: Bug Affects Versions: 0.89.20100924 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.0 This is kind of a funny bug, as long as you don't run into it. I thought I'= d be a good idea to kill the region servers that act as sinks when they can= 't replicate edits on their own cluster (this is often something we do in f= ace of fatal errors throughout the code), but not so much. So, last friday while I was using CopyTable to replicate data from a master= to a slave cluster while the new data was being replicated, one table got = really slow and took too long to split which tripped RetriesExhaustedExcept= ion coming out of HTable in ReplicationSink. This killed a first region ser= ver, which was itself hosting regions. Splitting the logs took a bit longer= since the cluster was under high insert load, so this triggered other exce= ptions in the other region servers, to a point where they were all down. I = restarted the cluster, the master splits all the logs that were remaining a= nd begins assigning regions. Some of them took too long to open because eac= h region server had a few regions to recover each and the last ones in the = queue were minutes from being opened. Since the master cluster was already = pushing edits to the slave, the region servers all got RetriesExhausted and= all went down again. I changed the client pause from 1 to 3 and restarted,= same happened. I changed it to 5, and finally was able to keep the cluster= up. Fortunately, the master cluster was queueing up the HLogs so we didn't= lose any data and the backlog was replicated in a few minutes. So, instead of killing the region server, any exception coming out of HTabl= e should just be treated as a failure to apply and the source cluster shoul= d retry later. --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.