Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D4D6A1079B for ; Thu, 17 Apr 2014 05:56:51 +0000 (UTC) Received: (qmail 85547 invoked by uid 500); 17 Apr 2014 05:56:49 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 85166 invoked by uid 500); 17 Apr 2014 05:56:48 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 85157 invoked by uid 99); 17 Apr 2014 05:56:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Apr 2014 05:56:46 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of csoroiu@gmail.com designates 209.85.216.45 as permitted sender) Received: from [209.85.216.45] (HELO mail-qa0-f45.google.com) (209.85.216.45) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Apr 2014 05:56:42 +0000 Received: by mail-qa0-f45.google.com with SMTP id cm18so11682197qab.32 for ; Wed, 16 Apr 2014 22:56:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=nE9/hbDZYJFxztOcpZAorqsiJuTz0kYBYaOmhCvQhRE=; b=Pgty6MbBSQ5nlq9AtzjhogzM9eHiDUUczYjURqyxwyASeloTYzkYQlQAwF58QfZmOZ rgzGRxZaPYi/dIfkbA0CCXtmcYZK/wDgTTIgFhH9138GfG6hdDbpEpc8vTVoWuCBSunI sQCf2E/sRJj1NqL6OVZCWQTjRltuXvFo8g1y+FnT7UQ0qNbmod6vzI7xQAHP2onuhJls iVmkhOQYhzWuTbUJVInOwAQZS+9LnuS1TtPaxY3f5Ohd/0Ix/HyzgSBfxKlXRX34OlHl Y6a7bX3MsPzucKrcKUs9MN96TdLRKv4H5+1Mb0Aqa3aKp8KVg8ZHPuAOLKkBKw93ADKv UyHA== MIME-Version: 1.0 X-Received: by 10.140.89.234 with SMTP id v97mr14855364qgd.20.1397714180227; Wed, 16 Apr 2014 22:56:20 -0700 (PDT) Received: by 10.229.14.131 with HTTP; Wed, 16 Apr 2014 22:56:20 -0700 (PDT) Received: by 10.229.14.131 with HTTP; Wed, 16 Apr 2014 22:56:20 -0700 (PDT) In-Reply-To: References: Date: Thu, 17 Apr 2014 08:56:20 +0300 Message-ID: Subject: Re: HBase region server failure issues From: Claudiu Soroiu To: dev@hbase.apache.org Content-Type: multipart/alternative; boundary=001a11c1161435c5e604f736b1c1 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c1161435c5e604f736b1c1 Content-Type: text/plain; charset=UTF-8 Thanks for the hints. I will take a look and explore the idea. Claudiu On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu wrote: > First of all, thanks for the clarifications. > > **how about 300 regions with 3x replication? Or 1000 regions? This > is going to be 3000 files. on HDFS. per one RS.** > > Now i see that the trade-off is how to reduce the recovery time without > affecting the overall performance of the cluster. > Having too many WAL's affects the write performance. > Basically multiple WAL's might improve the process but the number of WAL's > should be relatively small. > > Would it be feasible to know ahead of time where a region might activate in > case of a failure and have for each region server a second WAL file > containing backup edits? > E.g. If machine B crashes then a region will be assigned to node A, one to > node C, etc. > Also another view would be: Server A will backup a region from Server B if > crashes, a region from server C, etc. Basically this second WAL will > contain the data that is needed to fast recover a crashed node. > This adds additional redundancy and some degree of complexity to the > solution but ensures data locality in case of a crash and faster recovery. > > This sounds like what I called Shadow Memstores. This depends on hdfs file affinity groups, (favored nodes could help but isn't guaranteed), and could be used for super fast edit recovery. See this thread and jira. HEre's a link to a doc I posted on the HBASE-10070 jira. This requires some simplifications on the master side, and should be compatible with the current approach in HBASE-10070. https://docs.google.com/document/d/1q5kJTOA3sZ760sHkORGZNeWgNuMzP41PnAXtaCgPgEU/edit#heading=h.pyxl4wbui0l > **What did you do Claudiu to get the time down?** > > Decreased the hdfs block size to 64 megs for now. > Enabled settings to avoid hdfs stale nodes. > Cluster I tested this was relatively small - 10 computers. > I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds for > the moment, and plan to decrease this value. > At this point dfs.heartbeat.interval is set at the default 3 seconds, but > this I also plan to decrease and perform a more intensive test. > (Decreasing the times is based on the experience with our current system > configured at 1.2 seconds and didn't had any issues even under heavy loads, > obviously stop the world GC times should be smaller that the heartbeat > interval) > And I remember i did some changes for the reconnect intervals of the > client to allow him to reconnect to the region as fast as possible. > I am in an early stage of experimenting with hbase but there are lot of > things to test/check... > > > > > On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov > wrote: > > > *We also had a global HDFS file limit to contend with* > > > > Yes, we have been seeing this from time to time in our production > clusters. > > Periodic purging of old files helps, but the issue is obvious. > > > > -Vladimir Rodionov > > > > > > On Tue, Apr 15, 2014 at 11:58 AM, Stack wrote: > > > > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu > > wrote: > > > > > > > .... > > > > > > After some tunning I managed to > > > > reduce it to 8 seconds in total and for the moment it fits the needs. > > > > > > > > > > What did you do Claudiu to get the time down? > > > Thanks, > > > St.Ack > > > > > > -- // Jonathan Hsieh (shay) // HBase Tech Lead, Software Engineer, Cloudera // jon@cloudera.com // @jmhsieh --001a11c1161435c5e604f736b1c1--