Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CB26E10641 for ; Tue, 3 Dec 2013 22:19:19 +0000 (UTC) Received: (qmail 82277 invoked by uid 500); 3 Dec 2013 22:19:19 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 82218 invoked by uid 500); 3 Dec 2013 22:19:19 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 82210 invoked by uid 99); 3 Dec 2013 22:19:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Dec 2013 22:19:18 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of enis.soz@gmail.com designates 74.125.82.45 as permitted sender) Received: from [74.125.82.45] (HELO mail-wg0-f45.google.com) (74.125.82.45) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Dec 2013 22:19:14 +0000 Received: by mail-wg0-f45.google.com with SMTP id y10so12522832wgg.12 for ; Tue, 03 Dec 2013 14:18:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=mugUhdkjd5uyncpXLTa8LsTNRDRRxLliGwifriy5etU=; b=tmB9fZ7oH4Wt2omhjBBKcNDEtiKmO2Vd7tubPGlTa1jXuGUeBJL4RYM9qDOLMmEPBZ zFJRHKfiImZeA2m1XIJTf/NWjGJ3Dce0V870ZxPbX9wZtnbNZiL2Io68bOAI5UEiEX2m qyK7kY+toaHllyloHpCfgdu0vZ/AZjNQsW1rlARDsOvOcyJ7vXxV+hcdqzWoXCJ73ndZ yH02IX7Bp38mbGGUsitu9N1XKlvLHeQ5wWF7ZD/WPHilXkf30xSS2hEAjrrDETZxnhO7 Rlh6CmbgUD4KZdsiIbQDiV3xyhSFELaS2ecoqrQb1adXROqmhZaQ3+dRclihoj91NoDP Jv0Q== X-Received: by 10.194.240.197 with SMTP id wc5mr62073941wjc.23.1386109133181; Tue, 03 Dec 2013 14:18:53 -0800 (PST) MIME-Version: 1.0 Received: by 10.194.48.15 with HTTP; Tue, 3 Dec 2013 14:18:33 -0800 (PST) In-Reply-To: References: From: =?UTF-8?Q?Enis_S=C3=B6ztutar?= Date: Tue, 3 Dec 2013 14:18:33 -0800 Message-ID: Subject: Re: [Shadow Regions / Read Replicas ] To: "dev@hbase.apache.org" Content-Type: multipart/alternative; boundary=089e013d19cc80f9ed04eca8ae36 X-Virus-Checked: Checked by ClamAV on apache.org --089e013d19cc80f9ed04eca8ae36 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov wrote: > The downside: > > - Double/Triple memstore usage > - Increased block cache usage (effectively, block cache will have 50% > capacity may be less) These are covered at the tradeoff section at the design doc. > > These downsides are pretty serious ones. This will result: > > 1. in decreased overall performance due to decreased efficient block cach= e > size > You can elect to not fill up the block cache for secondary reads. It will be a configuration option, and a tradeoff you may or may not want to pay. Details are in the doc. > 2. In more frequent memstore flushes - this will affect compaction and > write tput. > More frequent flushes is not needed unless you are using region snapshots approach, and want to bound the lag better. It is a tradeoff between expected lag vs more write amplification. > > I do not believe that HBase 'large' MTTR does not allow to meet 99% SLA. > of 10-20ms unless your RSs go down 2-3 times a day for several minutes ea= ch > time. You have to analyze first why are you having so frequent failures, > than fix the root source of the problem. Its possible to reduce 'detectio= n' > phase in MTTR process to couple seconds either by using external beacon > process (as I suggested already) or by rewriting some code inside HBase a= nd > NameNode to move all data out from Java heap to off-heap and reducing > GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable. Th= e > result: you will decrease MTTR by 50% at least w/o sacrificing the overal= l > cluster performance. > > I think, its RS and NN large heaps and frequent s-t-w GC activities > prevents meeting strict SLA - not occasional server failures. > MTTR and this work is ortagonal. In a distributed system, you cannot differentiate between a process not responding because it is down or it is busy or network is down, or whatnot. Having a couple of seconds detection time is unrealistic. You will end up in a very unstable state where you will be failing servers all over the place. An external beacon also cannot differentiate between the main process not responding because it is busy, or it is down. What happens why there is a temporary network partition. > > > > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh wrote: > > > To keep the discussion focused on the design goals, I'm going start > > referring to enis and deveraj's eventually consistent read replicas as > the > > *read replica* design, and consistent fast read recovery mechanism base= d > on > > shadowing/tailing the wals as *shadow regions* or *shadow memstores*. > Can > > we agree on nomenclature? > > > > > > On Tue, Dec 3, 2013 at 11:07 AM, Enis S=C3=B6ztutar w= rote: > > > > > Thanks Jon for bringing this to dev@. > > > > > > > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh > > wrote: > > > > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier" instea= d > of > > > > tackling a feature that other systems architecturally can do better > > > > (inconsistent reads). I consider consistent reads/writes being on= e > of > > > > HBase's defining features. That said, I think read replicas makes > sense > > > and > > > > is a nice feature to have. > > > > > > > > > > Our design proposal has a specific use case goal, and hopefully we ca= n > > > demonstrate the > > > benefits of having this in HBase so that even more pieces can be buil= t > on > > > top of this. Plus I imagine this will > > > be a widely used feature for read-only tables or bulk loaded tables. = We > > are > > > not > > > proposing of reworking strong consistency semantics or major > > architectural > > > changes. I think by > > > having the tables to be defined with replication count, and the > proposed > > > client API changes (Consistency definition) > > > plugs well into the HBase model rather well. > > > > > > > > I do agree think that without any recent updating mechanism, we are > > limiting this usefulness of this feature to essentially *only* the > > read-only or bulk load only tables. Recency if there were any > > edits/updates would be severely lagging (by default potentially an hour= ) > > especially in cases where there are only a few edits to a primarily bul= k > > loaded table. This limitation is not mentioned in the tradeoffs or > > requirements (or a non-requirements section) definitely should be liste= d > > there. > > > > With the current design it might be best to have a flag on the table > which > > marks it read-only or bulk-load only so that it only gets used by users > > when the table is in that mode? (and maybe an "escape hatch" for power > > users). > > > > [snip] > > > > > > - I think the two goals are both worthy on their own each with their > own > > > > optimal points. We should in the design makes sure that we can > support > > > > both goals. > > > > > > > > > > I think our proposal is consistent with your doc, and we have > considered > > > secondary region promotion > > > in the future section. It would be good if you can review and comment > on > > > whether you see any points > > > missing. > > > > > > > > > I definitely will. At the moment, I think the hybrid for the > wals/hlogs I > > suggested in the other thread seems to be an optimal solution consideri= ng > > locality. Though feasible is obviously more complex than just one > approach > > alone. > > > > > > > > - I want to making sure the proposed design have a path for optimal > > > > fast-consistent read-recovery. > > > > > > > > > > We think that it is, but it is a secondary goal for the initial work.= I > > > don't see any reason why secondary > > > promotion cannot be build on top of this, once the branch is in a > better > > > state. > > > > > > > Based on the detail in the design doc and this statement it sounds like > you > > have a prototype branch already? Is this the case? > > > > -- > > // Jonathan Hsieh (shay) > > // Software Engineer, Cloudera > > // jon@cloudera.com > > > --089e013d19cc80f9ed04eca8ae36--