hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Söztutar <enis....@gmail.com>
Subject Re: [Shadow Regions / Read Replicas ]
Date Wed, 04 Dec 2013 22:23:12 GMT
On Wed, Dec 4, 2013 at 12:25 PM, Jimmy Xiang <jxiang@cloudera.com> wrote:

> I am concerned about reading stale data. I understand some people may want
> this feature. One of the reason is about the region availability. If we
> make sure those regions are always available, we don't have to compromise,
> right?  How about we support something like region pipeline? For each
> important region, we assign it to two/three region servers and make sure
> all writes are on all three region instances, and just one of them persists
> data to hlog, or each region instance has its own local hlog (on local fs,
> not hdfs). Is this too complex to consider, or write overhead is too high?
>

It is not that simple. In a pipeline model, you can only do reads from the
primary
since only that node knows about what is committed and what is not. hdfs
pipelines
works when reading from other replicas even when the pipeline is still
open, because the
data is immutable. The length of the block is made when the block replica
ACK's it. In hdfs' case
the pipeline is like a append only WAL with length as the transaction id.

In a pipelined sync replication style (like ZAB or RAFT) you still have to
read from the primary
for doing consistent reads, because the followers do not learn about the
commits until after leader commits them
and sends the commit message.

I think having paxos-style quorum reads might decide what is committed and
what is not, and can
provide strong consistency but I am still not sure on the exact details of
a practical system.


>
> On Tue, Dec 3, 2013 at 10:20 PM, Devaraj Das <ddas@hortonworks.com> wrote:
>
> > On Tue, Dec 3, 2013 at 6:47 PM, Jonathan Hsieh <jon@cloudera.com> wrote:
> >
> > > On Tue, Dec 3, 2013 at 2:04 PM, Enis Söztutar <enis.soz@gmail.com>
> > wrote:
> > >
> > > > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jon@cloudera.com>
> > > wrote:>
> > > >  >
> > > > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <enis@apache.org>
> > > wrote:
> > > > >
> > > > > > Thanks Jon for bringing this to dev@.
> > > > > >
> > > > > >
> > > > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <
> jon@cloudera.com>
> > > > > wrote:
> > > > > >
> > > > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> > > instead
> > > > of
> > > > > > > tackling a feature that other systems architecturally can
do
> > better
> > > > > > > (inconsistent reads).   I consider consistent reads/writes
> being
> > > one
> > > > of
> > > > > > > HBase's defining features. That said, I think read replicas
> makes
> > > > sense
> > > > > > and
> > > > > > > is a nice feature to have.
> > > > > > >
> > > > > >
> > > > > > Our design proposal has a specific use case goal, and hopefully
> we
> > > can
> > > > > > demonstrate the
> > > > > > benefits of having this in HBase so that even more pieces can
be
> > > built
> > > > on
> > > > > > top of this. Plus I imagine this will
> > > > > > be a widely used feature for read-only tables or bulk loaded
> > tables.
> > > We
> > > > > are
> > > > > > not
> > > > > > proposing of reworking strong consistency semantics or major
> > > > > architectural
> > > > > > changes. I think by
> > > > > > having the tables to be defined with replication count, and
the
> > > > proposed
> > > > > > client API changes (Consistency definition)
> > > > > > plugs well into the HBase model rather well.
> > > > > >
> > > > > >
> > > > > I do agree think that without any recent updating mechanism, we are
> > > > > limiting this usefulness of this feature to essentially *only* the
> > > > > read-only or bulk load only tables.  Recency if there were any
> > > > > edits/updates would be severely lagging (by default potentially an
> > > hour)
> > > > > especially in cases where there are only a few edits to a primarily
> > > bulk
> > > > > loaded table.  This limitation is not mentioned in the tradeoffs
or
> > > > > requirements (or a non-requirements section) definitely should be
> > > listed
> > > > > there.
> > > > >
> > > >
> > > > Obviously the amount of lag you would observe depends on whether you
> > are
> > > > using
> > > > "Region snapshots", "WAL-Tailing" or "Async wal replication". I think
> > > there
> > > > are still
> > > > use cases where you can live with >1 hour old stale reads, so that
> > > "Region
> > > > snapshots"
> > > > is not *just* for read-only tables. I'll add these to the tradeoff's
> > > > section.
> > > >
> > >
> > > Thanks for adding it there -- I really think it is a big headline
> caveat
> > on
> > > my expectation of "eventual consistency".  Other systems out there that
> > > give you eventually consistency on the millisecond level for most
> cases,
> > > while this initial implementation would has eventual mean 10's of
> minutes
> > > or even handfuls of minutes behind (with the snapshots flush
> mechanism)!
> > >
> > >
> > But that's just how the implementation is broken up currently. When WAL
> > tailing is implemented, we will be close, maybe, in the order of seconds
> > behind.
> >
> >
> > > There are a handful of other things in the phase one part of the
> > > implementation section that limit the usefulness of the feature to a
> > > certain kind of constrained hbase user.  I'll start another thread for
> > > those.
> > >
> > >
> > Cool. The one thing I just realized is that we might have some additional
> > work to handle security issues for the shadow regions.
> >
> >
> > >
> > > >
> > > > We are proposing to implement "Region snapshots" first and "Async wal
> > > > replication" second.
> > > > As argued, I think wal-tailing only makes sense with WALpr so, that
> > work
> > > is
> > > > left until after we have WAL
> > > > per region.
> > > >
> > > >
> > > This is our main disagreement -- I'm not convinced that wal tailing
> only
> > > making sense for the wal per region hlog implementation.  Instead of
> > > bouncing around hypotheticals, it sounds like I'll be doing more
> > > experiments to prove it to myself and to convince you. :)
> > >
> > >
> > >
> > Thanks :-) Async WAL replication approach outlined in the doc does not
> > require WALpr and also has the advantage that the source itself can
> direct
> > the edits to specific other regionservers hosting the replicas in
> question.
> >
> >
> > > >
> > > > >
> > > > > With the current design it might be best to have a flag on the
> table
> > > > which
> > > > > marks it read-only or bulk-load only so that it only gets used by
> > users
> > > > > when the table is in that mode?  (and maybe an "escape hatch" for
> > power
> > > > > users).
> > > > >
> > > >
> > > > I think we have a read-only flag already. We might not have bulk-load
> > > only
> > > > flag though. Makes sense to add it
> > > > if we want to restrict allowing bulk loads but preventing writes.
> > > >
> > > > Great.
> > >
> > > >
> > > > >
> > > > > [snip]
> > > > > >
> > > > > > - I think the two goals are both worthy on their own each with
> > their
> > > > own
> > > > > > > optimal points.  We should in the design makes sure that
we can
> > > > support
> > > > > > > both goals.
> > > > > > >
> > > > > >
> > > > > > I think our proposal is consistent with your doc, and we have
> > > > considered
> > > > > > secondary region promotion
> > > > > > in the future section. It would be good if you can review and
> > comment
> > > > on
> > > > > > whether you see any points
> > > > > > missing.
> > > > > >
> > > > > >
> > > > > > I definitely will. At the moment, I think the hybrid for the
> > > > wals/hlogs I
> > > > > suggested in the other thread seems to be an optimal solution
> > > considering
> > > > > locality.  Though feasible is obviously more complex than just one
> > > > approach
> > > > > alone.
> > > > >
> > > > >
> > > > > > > - I want to making sure the proposed design have a path
for
> > optimal
> > > > > > > fast-consistent read-recovery.
> > > > > > >
> > > > > >
> > > > > > We think that it is, but it is a secondary goal for the initial
> > > work. I
> > > > > > don't see any reason why secondary
> > > > > > promotion cannot be build on top of this, once the branch is
in a
> > > > better
> > > > > > state.
> > > > > >
> > > > >
> > > > > Based on the detail in the design doc and this statement it sounds
> > like
> > > > you
> > > > > have a prototype branch already?  Is this the case?
> > > > >
> > > >
> > > > Indeed. I think that is mentioned in the jira description. We have
> some
> > > > parts of the
> > > > changes for region, region server, HRI, and master. Client changes
> are
> > on
> > > > the way.
> > > > I think we can post that in a github branch for now to share the code
> > > early
> > > > and solicit
> > > > early reviews.
> > > >
> > > > I think that would be great.  Back when we did snapshots, we had
> active
> > > development against a prototype and spent a bit of time breaking it
> down
> > > into manageable more polished pieces that had slightly lenient reviews.
> > >  This exercise really helped us with our interfaces.  We committed code
> > to
> > > the dev branch which limited merge pains and diff for modifications
> made
> > by
> > > different contributors.  In the end when we had something we were happy
> > > with on the dev branch we merged with trunk and fixed bugs/diffs that
> > > cropped up in the mean time.  I'd suggest a similar process for this.
> > >
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message