Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of jon@cloudera.com designates
 209.85.220.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFB=OSyycv1BuNUi3xFfT9A2voGM-Dry+NxPrJ2yU6GcArdt7w@mail.gmail.com>
References: 
 <CAFB=OSy9eCyLPGvYgZHS8X+oaXrqzXA5JL-qmABbKb7Au5XDpA@mail.gmail.com>
 <CADcMMgFrsJo1t=SuQ614VHm19ynBRv9E6wMZ5XPoZhw58rdfqQ@mail.gmail.com>
 <CAAg3a2rh8O62hjpgkykxMDuhRUpSKD1T3f0+Zv3zHucT8zLU4g@mail.gmail.com>
 <CAFB=OSyycv1BuNUi3xFfT9A2voGM-Dry+NxPrJ2yU6GcArdt7w@mail.gmail.com>
From: Jonathan Hsieh <jon@cloudera.com>
Date: Wed, 16 Apr 2014 07:49:50 -0700
Message-ID: 
 <CAAha9a0Q7B9UGvnHd21_CyNO5sGmk6ZCNAY57HZnpNcvgcGRCA@mail.gmail.com>
Subject: Re: HBase region server failure issues
To: "dev@hbase.apache.org" <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary=047d7b6704cf89ee1c04f72a086f

--047d7b6704cf89ee1c04f72a086f
Content-Type: text/plain; charset=ISO-8859-1

On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu <csoroiu@gmail.com> wrote:

> First of all, thanks for the clarifications.
>
> **how about 300 regions with 3x replication?  Or 1000 regions? This
> is going to be 3000 files. on HDFS. per one RS.**
>
> Now i see that the trade-off is how to reduce the recovery time without
> affecting the overall performance of the cluster.
> Having too many WAL's affects the write performance.
> Basically multiple WAL's might improve the process but the number of WAL's
> should be relatively small.
>
> Would it be feasible to know ahead of time where a region might activate in
> case of a failure and have for each region server a second WAL file
> containing backup edits?
> E.g. If machine B crashes then a region will be assigned to node A,  one to
> node C, etc.
> Also another view would be: Server A will backup a region from Server B if
> crashes, a region from server C, etc. Basically this second WAL will
> contain the data that is needed to fast recover a crashed node.
> This adds additional redundancy and some degree of complexity to the
> solution but ensures data locality in case of a crash and faster recovery.
>
>
This sounds like what I called Shadow Memstores.  This depends on hdfs file
affinity groups, (favored nodes could help but isn't guaranteed), and could
be used for super fast edit recovery.  See this thread and jira.  HEre's a
link to a doc I posted on the HBASE-10070 jira.  This requires some
simplifications on the master side, and should be compatible with the
current approach in HBASE-10070.

https://docs.google.com/document/d/1q5kJTOA3sZ760sHkORGZNeWgNuMzP41PnAXtaCgPgEU/edit#heading=h.pyxl4wbui0l


> **What did you do Claudiu to get the time down?**
>
>  Decreased the hdfs block size to 64 megs for now.
>  Enabled settings to avoid hdfs stale nodes.
>  Cluster I tested this was relatively small - 10 computers.
>  I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds for
> the moment, and plan to decrease this value.
>  At this point dfs.heartbeat.interval is set at the default 3 seconds, but
> this I also plan to decrease and perform a more intensive test.
>  (Decreasing the times is based on the experience with our current system
> configured at 1.2 seconds and didn't had any issues even under heavy loads,
> obviously stop the world GC times should be smaller that the heartbeat
> interval)
>  And I remember i did some changes for the reconnect intervals of the
> client to allow him to reconnect to the region as fast as possible.
>  I am in an early stage of experimenting with hbase but there are lot of
> things to test/check...
>
>
>
>
> On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov
> <vladrodionov@gmail.com>wrote:
>
> > *We also had a global HDFS file limit to contend with*
> >
> > Yes, we have been seeing this from time to time in our production
> clusters.
> > Periodic purging of old files helps, but the issue is obvious.
> >
> > -Vladimir Rodionov
> >
> >
> > On Tue, Apr 15, 2014 at 11:58 AM, Stack <stack@duboce.net> wrote:
> >
> > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <csoroiu@gmail.com>
> > wrote:
> > >
> > > > ....
> > >
> > > After some tunning I managed to
> > > > reduce it to 8 seconds in total and for the moment it fits the needs.
> > > >
> > >
> > > What did you do Claudiu to get the time down?
> > > Thanks,
> > > St.Ack
> > >
> >
>


-- 
// Jonathan Hsieh (shay)
// HBase Tech Lead, Software Engineer, Cloudera
// jon@cloudera.com // @jmhsieh

--047d7b6704cf89ee1c04f72a086f--