Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of enis.soz@gmail.com designates
 74.125.82.45 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAg3a2qdTRmC9Pk6Gm8G=kE4qBYXNHz64x_OaWEA6EznnpB9eg@mail.gmail.com>
References: 
 <CAAha9a3bJcwKnLtKGg6xHKDX2h-+BqS1eBSa6=xxVbF1=3+w7A@mail.gmail.com>
 <CAAha9a19kmheAhPk4aTAHUoKEnqp6BU87pM1oCK8vkMtwnhMhQ@mail.gmail.com>
 <CAMUu0w9YrbaSrOv8qb3NxgyV6M5oJVHMwc7BDv-PTO6soOXs3Q@mail.gmail.com>
 <CAAha9a3woDaNKX0PuJb6W-9HTcSyRBzJ0V2rFefRWpYO9+oZFg@mail.gmail.com>
 <CAAg3a2qdTRmC9Pk6Gm8G=kE4qBYXNHz64x_OaWEA6EznnpB9eg@mail.gmail.com>
From: =?UTF-8?Q?Enis_S=C3=B6ztutar?= <enis.soz@gmail.com>
Date: Tue, 3 Dec 2013 14:18:33 -0800
Message-ID: 
 <CAMUu0w8NJAQHTh97gLoTBZ1tOO73-nvkp+b+1VxBr+ta1Qp_pQ@mail.gmail.com>
Subject: Re: [Shadow Regions / Read Replicas ]
To: "dev@hbase.apache.org" <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary=089e013d19cc80f9ed04eca8ae36

--089e013d19cc80f9ed04eca8ae36
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
<vladrodionov@gmail.com>wrote:

> The downside:
>
> - Double/Triple memstore usage
> - Increased block cache usage (effectively, block cache will have 50%
> capacity may be less)


These are covered at the tradeoff section at the design doc.


>
>
These downsides are pretty serious ones. This will result:
>
> 1. in decreased overall performance due to decreased efficient block cach=
e
> size
>

You can elect to not fill up the block cache for secondary reads. It will
be a configuration option, and a
tradeoff you may or may not want to pay. Details are in the doc.


>  2. In more frequent memstore flushes - this will affect compaction and
> write tput.
>

More frequent flushes is not needed unless you are using region snapshots
approach,
and want to bound the lag better. It is a tradeoff between expected lag vs
more
write amplification.


>
> I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> of 10-20ms unless your RSs go down 2-3 times a day for several minutes ea=
ch
> time. You have to analyze first why are you having so frequent failures,
> than fix the root source of the problem. Its possible to reduce 'detectio=
n'
> phase in MTTR process to couple seconds either by using external beacon
> process (as I suggested already) or by rewriting some code inside HBase a=
nd
> NameNode to move all data out from Java heap to off-heap and reducing
> GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable. Th=
e
> result: you will decrease MTTR by 50% at least w/o sacrificing the overal=
l
> cluster performance.
>
> I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> prevents meeting strict SLA - not occasional server failures.
>

MTTR and this work is ortagonal. In a distributed system, you cannot
differentiate between
a process not responding because it is down or it is busy or network is
down, or whatnot. Having
a couple of seconds detection time is unrealistic. You will end up in a
very unstable state where
you will be failing servers all over the place. An external beacon also
cannot differentiate between
the main process not responding because it is busy, or it is down. What
happens why there is a temporary
network partition.


>
>
>
> On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jon@cloudera.com> wrote:
>
> > To keep the discussion focused on the design goals, I'm going start
> > referring to enis and deveraj's eventually consistent read replicas as
> the
> > *read replica* design, and consistent fast read recovery mechanism base=
d
> on
> > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
>  Can
> > we agree on nomenclature?
> >
> >
> > On Tue, Dec 3, 2013 at 11:07 AM, Enis S=C3=B6ztutar <enis@apache.org> w=
rote:
> >
> > > Thanks Jon for bringing this to dev@.
> > >
> > >
> > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jon@cloudera.com>
> > wrote:
> > >
> > > > Fundamentally, I'd prefer focusing on making HBase "HBasier" instea=
d
> of
> > > > tackling a feature that other systems architecturally can do better
> > > > (inconsistent reads).   I consider consistent reads/writes being on=
e
> of
> > > > HBase's defining features. That said, I think read replicas makes
> sense
> > > and
> > > > is a nice feature to have.
> > > >
> > >
> > > Our design proposal has a specific use case goal, and hopefully we ca=
n
> > > demonstrate the
> > > benefits of having this in HBase so that even more pieces can be buil=
t
> on
> > > top of this. Plus I imagine this will
> > > be a widely used feature for read-only tables or bulk loaded tables. =
We
> > are
> > > not
> > > proposing of reworking strong consistency semantics or major
> > architectural
> > > changes. I think by
> > > having the tables to be defined with replication count, and the
> proposed
> > > client API changes (Consistency definition)
> > > plugs well into the HBase model rather well.
> > >
> > >
> > I do agree think that without any recent updating mechanism, we are
> > limiting this usefulness of this feature to essentially *only* the
> > read-only or bulk load only tables.  Recency if there were any
> > edits/updates would be severely lagging (by default potentially an hour=
)
> > especially in cases where there are only a few edits to a primarily bul=
k
> > loaded table.  This limitation is not mentioned in the tradeoffs or
> > requirements (or a non-requirements section) definitely should be liste=
d
> > there.
> >
> > With the current design it might be best to have a flag on the table
> which
> > marks it read-only or bulk-load only so that it only gets used by users
> > when the table is in that mode?  (and maybe an "escape hatch" for power
> > users).
> >
> > [snip]
> > >
> > > - I think the two goals are both worthy on their own each with their
> own
> > > > optimal points.  We should in the design makes sure that we can
> support
> > > > both goals.
> > > >
> > >
> > > I think our proposal is consistent with your doc, and we have
> considered
> > > secondary region promotion
> > > in the future section. It would be good if you can review and comment
> on
> > > whether you see any points
> > > missing.
> > >
> > >
> > > I definitely will. At the moment, I think the hybrid for the
> wals/hlogs I
> > suggested in the other thread seems to be an optimal solution consideri=
ng
> > locality.  Though feasible is obviously more complex than just one
> approach
> > alone.
> >
> >
> > > > - I want to making sure the proposed design have a path for optimal
> > > > fast-consistent read-recovery.
> > > >
> > >
> > > We think that it is, but it is a secondary goal for the initial work.=
 I
> > > don't see any reason why secondary
> > > promotion cannot be build on top of this, once the branch is in a
> better
> > > state.
> > >
> >
> > Based on the detail in the design doc and this statement it sounds like
> you
> > have a prototype branch already?  Is this the case?
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>

--089e013d19cc80f9ed04eca8ae36--