hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vladrodio...@gmail.com>
Subject Re: [Shadow Regions / Read Replicas ]
Date Tue, 03 Dec 2013 22:48:21 GMT
>MTTR and this work is ortagonal. In a distributed system, you cannot
>differentiate between
>a process not responding because it is down or it is busy or network is
>down, or whatnot. Having
>a couple of seconds detection time is unrealistic. You will end up in a
>very unstable state where
>you will be failing servers all over the place. An external beacon also
>cannot differentiate between
>the main process not responding because it is busy, or it is down. What
>happens why there is a temporary
>network partition.

Be pro-active, predict node failure (slow requests recently), detect
possible router/network issues (syslog on each node), temporal network
partitions are bad,  but they usually affect multiple servers - not just
one. Pro-activity means that Master can disable RS before RS will go down.
But ,you are right - its totally orthogonal to what you are proposing here.
I am just wondering, if FB claim 99.99% of their HBase availability
(HBaseCon 2013) may be it is worth borrowing some their ideas? How did they
achieve this?



On Tue, Dec 3, 2013 at 2:18 PM, Enis Söztutar <enis.soz@gmail.com> wrote:

> On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
> <vladrodionov@gmail.com>wrote:
>
> > The downside:
> >
> > - Double/Triple memstore usage
> > - Increased block cache usage (effectively, block cache will have 50%
> > capacity may be less)
>
>
> These are covered at the tradeoff section at the design doc.
>
>
> >
> >
> These downsides are pretty serious ones. This will result:
> >
> > 1. in decreased overall performance due to decreased efficient block
> cache
> > size
> >
>
> You can elect to not fill up the block cache for secondary reads. It will
> be a configuration option, and a
> tradeoff you may or may not want to pay. Details are in the doc.
>
>
> >  2. In more frequent memstore flushes - this will affect compaction and
> > write tput.
> >
>
> More frequent flushes is not needed unless you are using region snapshots
> approach,
> and want to bound the lag better. It is a tradeoff between expected lag vs
> more
> write amplification.
>
>
> >
> > I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> > of 10-20ms unless your RSs go down 2-3 times a day for several minutes
> each
> > time. You have to analyze first why are you having so frequent failures,
> > than fix the root source of the problem. Its possible to reduce
> 'detection'
> > phase in MTTR process to couple seconds either by using external beacon
> > process (as I suggested already) or by rewriting some code inside HBase
> and
> > NameNode to move all data out from Java heap to off-heap and reducing
> > GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable.
> The
> > result: you will decrease MTTR by 50% at least w/o sacrificing the
> overall
> > cluster performance.
> >
> > I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> > prevents meeting strict SLA - not occasional server failures.
> >
>
> MTTR and this work is ortagonal. In a distributed system, you cannot
> differentiate between
> a process not responding because it is down or it is busy or network is
> down, or whatnot. Having
> a couple of seconds detection time is unrealistic. You will end up in a
> very unstable state where
> you will be failing servers all over the place. An external beacon also
> cannot differentiate between
> the main process not responding because it is busy, or it is down. What
> happens why there is a temporary
> network partition.
>
>
>
> >
> >
> >
> > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jon@cloudera.com>
> wrote:
> >
> > > To keep the discussion focused on the design goals, I'm going start
> > > referring to enis and deveraj's eventually consistent read replicas as
> > the
> > > *read replica* design, and consistent fast read recovery mechanism
> based
> > on
> > > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
> >  Can
> > > we agree on nomenclature?
> > >
> > >
> > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <enis@apache.org>
> wrote:
> > >
> > > > Thanks Jon for bringing this to dev@.
> > > >
> > > >
> > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jon@cloudera.com>
> > > wrote:
> > > >
> > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> instead
> > of
> > > > > tackling a feature that other systems architecturally can do better
> > > > > (inconsistent reads).   I consider consistent reads/writes being
> one
> > of
> > > > > HBase's defining features. That said, I think read replicas makes
> > sense
> > > > and
> > > > > is a nice feature to have.
> > > > >
> > > >
> > > > Our design proposal has a specific use case goal, and hopefully we
> can
> > > > demonstrate the
> > > > benefits of having this in HBase so that even more pieces can be
> built
> > on
> > > > top of this. Plus I imagine this will
> > > > be a widely used feature for read-only tables or bulk loaded tables.
> We
> > > are
> > > > not
> > > > proposing of reworking strong consistency semantics or major
> > > architectural
> > > > changes. I think by
> > > > having the tables to be defined with replication count, and the
> > proposed
> > > > client API changes (Consistency definition)
> > > > plugs well into the HBase model rather well.
> > > >
> > > >
> > > I do agree think that without any recent updating mechanism, we are
> > > limiting this usefulness of this feature to essentially *only* the
> > > read-only or bulk load only tables.  Recency if there were any
> > > edits/updates would be severely lagging (by default potentially an
> hour)
> > > especially in cases where there are only a few edits to a primarily
> bulk
> > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > requirements (or a non-requirements section) definitely should be
> listed
> > > there.
> > >
> > > With the current design it might be best to have a flag on the table
> > which
> > > marks it read-only or bulk-load only so that it only gets used by users
> > > when the table is in that mode?  (and maybe an "escape hatch" for power
> > > users).
> > >
> > > [snip]
> > > >
> > > > - I think the two goals are both worthy on their own each with their
> > own
> > > > > optimal points.  We should in the design makes sure that we can
> > support
> > > > > both goals.
> > > > >
> > > >
> > > > I think our proposal is consistent with your doc, and we have
> > considered
> > > > secondary region promotion
> > > > in the future section. It would be good if you can review and comment
> > on
> > > > whether you see any points
> > > > missing.
> > > >
> > > >
> > > > I definitely will. At the moment, I think the hybrid for the
> > wals/hlogs I
> > > suggested in the other thread seems to be an optimal solution
> considering
> > > locality.  Though feasible is obviously more complex than just one
> > approach
> > > alone.
> > >
> > >
> > > > > - I want to making sure the proposed design have a path for optimal
> > > > > fast-consistent read-recovery.
> > > > >
> > > >
> > > > We think that it is, but it is a secondary goal for the initial
> work. I
> > > > don't see any reason why secondary
> > > > promotion cannot be build on top of this, once the branch is in a
> > better
> > > > state.
> > > >
> > >
> > > Based on the detail in the design doc and this statement it sounds like
> > you
> > > have a prototype branch already?  Is this the case?
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message