Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 40D6511538 for ; Wed, 16 Apr 2014 14:50:47 +0000 (UTC) Received: (qmail 123 invoked by uid 500); 16 Apr 2014 14:50:38 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 99724 invoked by uid 500); 16 Apr 2014 14:50:38 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 99509 invoked by uid 99); 16 Apr 2014 14:50:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Apr 2014 14:50:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jon@cloudera.com designates 209.85.220.172 as permitted sender) Received: from [209.85.220.172] (HELO mail-vc0-f172.google.com) (209.85.220.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Apr 2014 14:50:33 +0000 Received: by mail-vc0-f172.google.com with SMTP id la4so10847158vcb.17 for ; Wed, 16 Apr 2014 07:50:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=tI+g6AnuQavRaQvNPGQcNXi4B+uOSoyR56RWABjALD4=; b=Nirdz0aC6rac3FWAZGYdJdqsaNyHuA/Na62w4KV+XgANcSudzG+vyNosw3S5FpCp53 ULW6wRpA5NCsh1wqdFJ1JLYy7u2Y8WBm2GUZFCMbULpFBSPcM8muBEXrsDT41nu0ypUG LxdWPcAAkYia7p0Vhd82s253U6iYsvZKS30t+/yP0QrWxrKTgm0uhTb6K9TiTKbfbVwl +Og2wc0RI7uaeSOKsEkmdII0OVOwWmcJkQaFNv6DmeEZtLRMwOCAo7Ea/GdRlgc7QpGw dOqS15NY+GtiHybLT7AjBN0Ma9kQp+4hRg2NG5bgeWGWvwpXGC6Vrn+6daIWdZ6FqKA7 Zt5Q== X-Gm-Message-State: ALoCoQkioNEkaE9Wm+wOf4Yd9hW9/58X0iy4kobRTxgg/AYnT+bkbE6DdEncH3l4QFdzYHk/g+0F X-Received: by 10.58.186.71 with SMTP id fi7mr583486vec.32.1397659810778; Wed, 16 Apr 2014 07:50:10 -0700 (PDT) MIME-Version: 1.0 Received: by 10.58.46.111 with HTTP; Wed, 16 Apr 2014 07:49:50 -0700 (PDT) In-Reply-To: References: From: Jonathan Hsieh Date: Wed, 16 Apr 2014 07:49:50 -0700 Message-ID: Subject: Re: HBase region server failure issues To: "dev@hbase.apache.org" Content-Type: multipart/alternative; boundary=047d7b6704cf89ee1c04f72a086f X-Virus-Checked: Checked by ClamAV on apache.org --047d7b6704cf89ee1c04f72a086f Content-Type: text/plain; charset=ISO-8859-1 On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu wrote: > First of all, thanks for the clarifications. > > **how about 300 regions with 3x replication? Or 1000 regions? This > is going to be 3000 files. on HDFS. per one RS.** > > Now i see that the trade-off is how to reduce the recovery time without > affecting the overall performance of the cluster. > Having too many WAL's affects the write performance. > Basically multiple WAL's might improve the process but the number of WAL's > should be relatively small. > > Would it be feasible to know ahead of time where a region might activate in > case of a failure and have for each region server a second WAL file > containing backup edits? > E.g. If machine B crashes then a region will be assigned to node A, one to > node C, etc. > Also another view would be: Server A will backup a region from Server B if > crashes, a region from server C, etc. Basically this second WAL will > contain the data that is needed to fast recover a crashed node. > This adds additional redundancy and some degree of complexity to the > solution but ensures data locality in case of a crash and faster recovery. > > This sounds like what I called Shadow Memstores. This depends on hdfs file affinity groups, (favored nodes could help but isn't guaranteed), and could be used for super fast edit recovery. See this thread and jira. HEre's a link to a doc I posted on the HBASE-10070 jira. This requires some simplifications on the master side, and should be compatible with the current approach in HBASE-10070. https://docs.google.com/document/d/1q5kJTOA3sZ760sHkORGZNeWgNuMzP41PnAXtaCgPgEU/edit#heading=h.pyxl4wbui0l > **What did you do Claudiu to get the time down?** > > Decreased the hdfs block size to 64 megs for now. > Enabled settings to avoid hdfs stale nodes. > Cluster I tested this was relatively small - 10 computers. > I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds for > the moment, and plan to decrease this value. > At this point dfs.heartbeat.interval is set at the default 3 seconds, but > this I also plan to decrease and perform a more intensive test. > (Decreasing the times is based on the experience with our current system > configured at 1.2 seconds and didn't had any issues even under heavy loads, > obviously stop the world GC times should be smaller that the heartbeat > interval) > And I remember i did some changes for the reconnect intervals of the > client to allow him to reconnect to the region as fast as possible. > I am in an early stage of experimenting with hbase but there are lot of > things to test/check... > > > > > On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov > wrote: > > > *We also had a global HDFS file limit to contend with* > > > > Yes, we have been seeing this from time to time in our production > clusters. > > Periodic purging of old files helps, but the issue is obvious. > > > > -Vladimir Rodionov > > > > > > On Tue, Apr 15, 2014 at 11:58 AM, Stack wrote: > > > > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu > > wrote: > > > > > > > .... > > > > > > After some tunning I managed to > > > > reduce it to 8 seconds in total and for the moment it fits the needs. > > > > > > > > > > What did you do Claudiu to get the time down? > > > Thanks, > > > St.Ack > > > > > > -- // Jonathan Hsieh (shay) // HBase Tech Lead, Software Engineer, Cloudera // jon@cloudera.com // @jmhsieh --047d7b6704cf89ee1c04f72a086f--