hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: [DISCUSSION] Merge Backup / Restore - Branch HBASE-7912
Date Mon, 12 Sep 2016 19:19:17 GMT
Mega patch (rev 18) is on HBASE-14123.

Please comment on HBASE-14123 on how you want to review.

Thanks

On Mon, Sep 12, 2016 at 12:15 PM, Stack <stack@duboce.net> wrote:

> On review of the 'patch', do I just compare the branch to master or is
> there a megapatch posted somewhere (I think I saw one but it seemed stale
> and then I 'lost' the tab). Sorry for dumb question.
> St.Ack
>
> On Mon, Sep 12, 2016 at 12:01 PM, Stack <stack@duboce.net> wrote:
>
> > Late to the game. A few comments after rereading this thread as a 'user'.
> >
> > + Before merge, a user-facing feature like this should work (If this is
> "higher-bar
> > for new features", bring it on -- smile).
> > + As a user, I tried the branch with tools after reviewing the
> just-posted
> > doc. I had an 'interesting' experience (left comments up on issue). I
> think
> > the tooling/doc. important to get right. If it breaks easily or is
> > inconsistent (or lacks 'polish'), operators will judge the whole
> > backup/restore tooling chain as not trustworthy and abandon it. Lets not
> > have this happen to this feature.
> > + Matteo's suggestion (with a helpful starter list) that there needs to
> be
> > explicit qualification on what is actually being delivered -- including a
> > listing of limitations (some look serious such as data bleed from other
> > regions in WALs, but maybe I don't care for my use case...) -- needs to
> > accompany the merge. Lets fold them into the user doc. in the technical
> > overview area as suggested so user expectations are properly managed
> > (otherwise, they expect the world and will just give up when we fall
> > short). Vladimir did a list of what is in each of the phases above which
> > would serve as a good start.
> > + Is this feature 'experimental' (Matteo asks above). I'd prefer it is
> > not. If it is, it should be labelled all over that it is so. I see
> current
> > state called out as a '... technical preview feature'. Does this mean
> > not-for-users?
> >
> > St.Ack
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Sep 12, 2016 at 8:03 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> >> Sean:
> >> Do you have more comments ?
> >>
> >> Cheers
> >>
> >> On Fri, Sep 9, 2016 at 1:42 PM, Vladimir Rodionov <
> vladrodionov@gmail.com
> >> >
> >> wrote:
> >>
> >> > Sean,
> >> >
> >> > Backup/Restore can fail due to various reasons: network outage
> (cluster
> >> > wide), various time-outs in HBase and HDFS layer, M/R failure due to
> >> "HDFS
> >> > exceeded quota", user error (manual deletion of data) and so on so on.
> >> That
> >> > is impossible to enumerate all possible types of failures in a
> >> distributed
> >> > system - that is not our goal/task.
> >> >
> >> > We focus completely on backup system table consistency in a presence
> of
> >> any
> >> > type of failure. That is what I call "tolerance to failures".
> >> >
> >> > On a failure:
> >> >
> >> > BACKUP. All backup system information (prior to backup) will be
> restored
> >> > and all temporary data, related to a failed session, in HDFS will be
> >> > deleted
> >> > RESTORE. We do not care about system data, because restore does not
> >> change
> >> > it. Temporary data in HDFS will be cleaned up and table will be in a
> >> state
> >> > back to where it was before operation started.
> >> >
> >> > This is what user should expect in case of a failure.
> >> >
> >> > -Vlad
> >> >
> >> >
> >> > -Vlad
> >> >
> >> > On Fri, Sep 9, 2016 at 12:56 PM, Sean Busbey <busbey@apache.org>
> wrote:
> >> >
> >> > > Failing in a consistent way, with docs that explain the various
> >> > > expected failures would be sufficient.
> >> > >
> >> > > On Fri, Sep 9, 2016 at 12:16 PM, Vladimir Rodionov
> >> > > <vladrodionov@gmail.com> wrote:
> >> > > > Do not worry Sean, doc is coming today as a preview and our writer
> >> > Frank
> >> > > > will be working on a putting  it into Apache repo. Timeline
> depends
> >> on
> >> > > > Franks schedule but I hope we will get it rather sooner than
> later.
> >> > > >
> >> > > > As for failure testing, we are focusing only on a consistent
state
> >> of
> >> > > > backup system data in a presence of any type of failures, We
are
> not
> >> > > going
> >> > > > to implement  anything more "fancy", than that. We allow both:
> >> backup
> >> > and
> >> > > > restore to fail. What we do not allow is to have system data
> >> corrupted.
> >> > > > Will it suffice for you? Do you have any other concerns, you
want
> >> us to
> >> > > > address?
> >> > > >
> >> > > > -Vlad
> >> > > >
> >> > > >
> >> > > > On Fri, Sep 9, 2016 at 10:56 AM, Sean Busbey <busbey@apache.org>
> >> > wrote:
> >> > > >
> >> > > >> "docs will come to Apache soon" does not address my concern
> around
> >> > docs
> >> > > at
> >> > > >> all, unless said docs have already made it into the project
> repo. I
> >> > > don't
> >> > > >> want third party resources for using a major and important
> feature
> >> of
> >> > > the
> >> > > >> project, I want us to provide end users with what they need
to
> get
> >> the
> >> > > job
> >> > > >> done.
> >> > > >>
> >> > > >> I see some calls for patience on the failure testing, but
the
> >> appeal
> >> > to
> >> > > us
> >> > > >> having done a bad job of requiring proper tests of previous
> >> features
> >> > > just
> >> > > >> makes me more concerned about not getting them here. I don't
want
> >> to
> >> > set
> >> > > >> yet another bad example that will then be pointed to in the
> future.
> >> > > >>
> >> > > >> On Sep 8, 2016 10:50, "Ted Yu" <yuzhihong@gmail.com>
wrote:
> >> > > >>
> >> > > >> > Is there any concern which is not addressed ?
> >> > > >> >
> >> > > >> > Do we need another Vote thread ?
> >> > > >> >
> >> > > >> > Thanks
> >> > > >> >
> >> > > >> > On Thu, Sep 8, 2016 at 9:21 AM, Andrew Purtell <
> >> apurtell@apache.org
> >> > >
> >> > > >> > wrote:
> >> > > >> >
> >> > > >> > > Vlad,
> >> > > >> > >
> >> > > >> > > I apologize for using the term 'half-baked' in
a way that
> could
> >> > > seem a
> >> > > >> > > description of HBASE-7912. I meant that as a general
> >> hypothetical.
> >> > > >> > >
> >> > > >> > > On Wed, Sep 7, 2016 at 9:36 AM, Vladimir Rodionov
<
> >> > > >> > vladrodionov@gmail.com>
> >> > > >> > > wrote:
> >> > > >> > >
> >> > > >> > > > >> I'm not sure that "There is already
lots of half-baked
> >> code
> >> > in
> >> > > the
> >> > > >> > > > branch,
> >> > > >> > > > so what's the harm in adding more?"
> >> > > >> > > >
> >> > > >> > > > I meant - not production - ready yet. This
is 2.0
> development
> >> > > branch
> >> > > >> > and,
> >> > > >> > > > hence many features are in works,
> >> > > >> > > > not being tested well etc. I do not consider
backup as half
> >> > baked
> >> > > >> > > feature -
> >> > > >> > > > it has passed our internal QA and has very
good doc, which
> we
> >> > will
> >> > > >> > > provide
> >> > > >> > > > to Apache shortly.
> >> > > >> > > >
> >> > > >> > > > -Vlad
> >> > > >> > > >
> >> > > >> > > > On Wed, Sep 7, 2016 at 9:13 AM, Andrew Purtell
<
> >> > > apurtell@apache.org>
> >> > > >> > > > wrote:
> >> > > >> > > >
> >> > > >> > > > > We shouldn't admit half baked changes
that won't be
> >> finished.
> >> > > >> However
> >> > > >> > > in
> >> > > >> > > > > this case the crew working on this feature
are long
> timers
> >> and
> >> > > less
> >> > > >> > > > likely
> >> > > >> > > > > than just about anyone to leave something
in a half baked
> >> > > state. Of
> >> > > >> > > > course
> >> > > >> > > > > there is no guarantee how anything will
turn out, but I
> am
> >> > > willing
> >> > > >> to
> >> > > >> > > > take
> >> > > >> > > > > a little on faith if they feel their
best path forward
> now
> >> is
> >> > to
> >> > > >> > merge
> >> > > >> > > to
> >> > > >> > > > > trunk. I only wish I had bandwidth to
have done some real
> >> > > kicking
> >> > > >> of
> >> > > >> > > the
> >> > > >> > > > > tires by now. Maybe this week.
> >> > > >> > > > >
> >> > > >> > > > > (Yes, I'm using some of that time for
this email :-) but
> I
> >> > type
> >> > > >> > fast.)
> >> > > >> > > > >
> >> > > >> > > > > That said, I would like to agitate for
making 2.0 more
> real
> >> > and
> >> > > >> spend
> >> > > >> > > > some
> >> > > >> > > > > time on it now that I'm winding down
with 0.98. I think
> >> that
> >> > > means
> >> > > >> > > > > branching for 2.0 real soon now and even
evicting things
> >> from
> >> > > 2.0
> >> > > >> > > branch
> >> > > >> > > > > that aren't finished or stable, leaving
them only once
> >> again
> >> > in
> >> > > the
> >> > > >> > > > master
> >> > > >> > > > > branch. Or, maybe just evicting them.
Let's take it case
> by
> >> > > case.
> >> > > >> > > > >
> >> > > >> > > > > I think this feature can come in relatively
safely. As
> >> added
> >> > > >> > insurance,
> >> > > >> > > > > let's admit the possibility it could
be reverted on the
> 2.0
> >> > > branch
> >> > > >> if
> >> > > >> > > > folks
> >> > > >> > > > > working on stabilizing 2.0 decide to
evict it because it
> is
> >> > > >> > unfinished
> >> > > >> > > or
> >> > > >> > > > > unstable, because that certainly can
happen. I would
> >> expect if
> >> > > talk
> >> > > >> > > like
> >> > > >> > > > > that starts, we'd get help finishing
or stabilizing
> what's
> >> > under
> >> > > >> > > > discussion
> >> > > >> > > > > for revert. Or, we'd have a revert. Either
way the
> outcome
> >> is
> >> > > >> > > acceptable.
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > On Wed, Sep 7, 2016 at 8:56 AM, Dima
Spivak <
> >> > > dimaspivak@apache.org
> >> > > >> >
> >> > > >> > > > wrote:
> >> > > >> > > > >
> >> > > >> > > > > > I'm not sure that "There is already
lots of half-baked
> >> code
> >> > in
> >> > > >> the
> >> > > >> > > > > branch,
> >> > > >> > > > > > so what's the harm in adding more?"
is a good code
> commit
> >> > > >> > philosophy
> >> > > >> > > > for
> >> > > >> > > > > a
> >> > > >> > > > > > fault-tolerant distributed data
store. ;)
> >> > > >> > > > > >
> >> > > >> > > > > > More seriously, a lack of test coverage
for existing
> >> > features
> >> > > >> > > shouldn't
> >> > > >> > > > > be
> >> > > >> > > > > > used as justification for introducing
new features with
> >> the
> >> > > same
> >> > > >> > > > > > shortcomings. Ultimately, it's the
end user who will
> feel
> >> > the
> >> > > >> pain,
> >> > > >> > > so
> >> > > >> > > > > > shouldn't we do everything we can
to mitigate that?
> >> > > >> > > > > >
> >> > > >> > > > > > -Dima
> >> > > >> > > > > >
> >> > > >> > > > > > On Wed, Sep 7, 2016 at 8:46 AM,
Vladimir Rodionov <
> >> > > >> > > > > vladrodionov@gmail.com>
> >> > > >> > > > > > wrote:
> >> > > >> > > > > >
> >> > > >> > > > > > > Sean,
> >> > > >> > > > > > >
> >> > > >> > > > > > > * have docs
> >> > > >> > > > > > >
> >> > > >> > > > > > > Agree. We have a doc and backup
is the most
> documented
> >> > > feature
> >> > > >> > :),
> >> > > >> > > we
> >> > > >> > > > > > will
> >> > > >> > > > > > > release it shortly to Apache.
> >> > > >> > > > > > >
> >> > > >> > > > > > > * have sunny-day correctness
tests
> >> > > >> > > > > > >
> >> > > >> > > > > > > Feature has  close to 60 test
cases, which run for
> >> approx
> >> > 30
> >> > > >> min.
> >> > > >> > > We
> >> > > >> > > > > can
> >> > > >> > > > > > > add more, if community do not
mind :)
> >> > > >> > > > > > >
> >> > > >> > > > > > > * have correctness-in-face-of-failure
tests
> >> > > >> > > > > > >
> >> > > >> > > > > > > Any examples of these tests
in existing features? In
> >> > works,
> >> > > we
> >> > > >> > > have a
> >> > > >> > > > > > clear
> >> > > >> > > > > > > understanding of what should
be done by the time of
> 2.0
> >> > > >> release.
> >> > > >> > > > > > > That is very close goal for
us, to verify IT monkey
> for
> >> > > >> existing
> >> > > >> > > > code.
> >> > > >> > > > > > >
> >> > > >> > > > > > > * don't rely on things outside
of HBase for normal
> >> > operation
> >> > > >> > (okay
> >> > > >> > > > for
> >> > > >> > > > > > > advanced operation)
> >> > > >> > > > > > >
> >> > > >> > > > > > > We do not.
> >> > > >> > > > > > >
> >> > > >> > > > > > > Enormous time has been spent
already on the
> development
> >> > and
> >> > > >> > testing
> >> > > >> > > > the
> >> > > >> > > > > > > feature, it has passed our
internal tests and many
> >> rounds
> >> > of
> >> > > >> code
> >> > > >> > > > > reviews
> >> > > >> > > > > > > by HBase committers. We do
not mind if someone from
> >> HBase
> >> > > >> > community
> >> > > >> > > > > > > (outside of HW) will review
the code, but it will
> >> probably
> >> > > >> takes
> >> > > >> > > > > forever
> >> > > >> > > > > > to
> >> > > >> > > > > > > wait for volunteer?, the feature
is quite large (1MB+
> >> > > >> cumulative
> >> > > >> > > > patch)
> >> > > >> > > > > > >
> >> > > >> > > > > > > 2.0 branch is full of half
baked features, most of
> them
> >> > are
> >> > > in
> >> > > >> > > active
> >> > > >> > > > > > > development, therefore I am
not following you here,
> >> Sean?
> >> > > Why
> >> > > >> > > > > HBASE-7912
> >> > > >> > > > > > is
> >> > > >> > > > > > > not good enough yet to be integrated
into 2.0 branch?
> >> > > >> > > > > > >
> >> > > >> > > > > > > -Vlad
> >> > > >> > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > > > On Wed, Sep 7, 2016 at 8:23
AM, Sean Busbey <
> >> > > busbey@apache.org
> >> > > >> >
> >> > > >> > > > wrote:
> >> > > >> > > > > > >
> >> > > >> > > > > > > > On Tue, Sep 6, 2016 at
10:36 PM, Josh Elser <
> >> > > >> > > josh.elser@gmail.com>
> >> > > >> > > > > > > wrote:
> >> > > >> > > > > > > > > So, the answer to
Sean's original question is "as
> >> > > robust as
> >> > > >> > > > > snapshots
> >> > > >> > > > > > > > > presently are"? (independence
of backup/restore
> >> > failure
> >> > > >> > > tolerance
> >> > > >> > > > > > from
> >> > > >> > > > > > > > > snapshot failure
tolerance)
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > > Is this just a question
WRT context of the
> change,
> >> or
> >> > > is it
> >> > > >> > > means
> >> > > >> > > > > > for a
> >> > > >> > > > > > > > veto
> >> > > >> > > > > > > > > from you, Sean? Just
trying to make sure I'm
> >> following
> >> > > >> along
> >> > > >> > > > > > > adequately.
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > I'd say ATM I'm -0, bordering
on -1 but not for
> >> reasons
> >> > I
> >> > > can
> >> > > >> > > > > > articulate
> >> > > >> > > > > > > > well.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > Here's an attempt.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > We've been trying to move,
as a community, towards
> >> > > minimizing
> >> > > >> > > risk
> >> > > >> > > > to
> >> > > >> > > > > > > > downstream folks by getting
"complete enough for
> use"
> >> > > gates
> >> > > >> in
> >> > > >> > > > place
> >> > > >> > > > > > > > before we introduce new
features. This was spurred
> >> by a
> >> > > some
> >> > > >> > > > features
> >> > > >> > > > > > > > getting in half-baked
and never making it to "can
> >> really
> >> > > use"
> >> > > >> > > > status
> >> > > >> > > > > > > > (I'm thinking of distributed
log replay and the
> >> zk-less
> >> > > >> > > assignment
> >> > > >> > > > > > > > stuff, I don't recall
if there was more).
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > The gates, generally,
included things like:
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > * have docs
> >> > > >> > > > > > > > * have sunny-day correctness
tests
> >> > > >> > > > > > > > * have correctness-in-face-of-failure
tests
> >> > > >> > > > > > > > * don't rely on things
outside of HBase for normal
> >> > > operation
> >> > > >> > > (okay
> >> > > >> > > > > for
> >> > > >> > > > > > > > advanced operation)
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > As an example, we kept
the MOB work off in a branch
> >> and
> >> > > out
> >> > > >> of
> >> > > >> > > > master
> >> > > >> > > > > > > > until it could pass these
criteria. The big
> exemption
> >> > > we've
> >> > > >> had
> >> > > >> > > to
> >> > > >> > > > > > > > this was the hbase-spark
integration, where we all
> >> > agreed
> >> > > it
> >> > > >> > > could
> >> > > >> > > > > > > > land in master because
it was very well isolated
> (the
> >> > > slide
> >> > > >> > away
> >> > > >> > > > from
> >> > > >> > > > > > > > including docs as a first-class
part of building up
> >> that
> >> > > >> > > > integration
> >> > > >> > > > > > > > has led me to doubt the
wisdom of this decision).
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > We've also been treating
inclusion in a "probably
> >> will
> >> > be
> >> > > >> > > released
> >> > > >> > > > to
> >> > > >> > > > > > > > downstream" branches as
a higher bar, requiring
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > * don't moderately impact
performance when the
> >> feature
> >> > > isn't
> >> > > >> in
> >> > > >> > > use
> >> > > >> > > > > > > > * don't severely impact
performance when the
> feature
> >> is
> >> > in
> >> > > >> use
> >> > > >> > > > > > > > * either default-to-on
or show enough demand to
> >> believe
> >> > a
> >> > > >> > > > non-trivial
> >> > > >> > > > > > > > number of folks will turn
the feature on
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > The above has kept MOB
and hbase-spark integration
> >> out
> >> > of
> >> > > >> > > branch-1,
> >> > > >> > > > > > > > presumably while they've
"gotten more stable" in
> >> master
> >> > > from
> >> > > >> > the
> >> > > >> > > > odd
> >> > > >> > > > > > > > vendor inclusion.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > Are we going to have a
2.0 release before the end
> of
> >> the
> >> > > >> year?
> >> > > >> > > > We're
> >> > > >> > > > > > > > coming up on 1.5 years
since the release of version
> >> 1.0;
> >> > > >> seems
> >> > > >> > > like
> >> > > >> > > > > > > > it's about time, though
I haven't seen any concrete
> >> > plans
> >> > > >> this
> >> > > >> > > > year.
> >> > > >> > > > > > > > Presuming we are going
to have one by the end of
> the
> >> > > year, it
> >> > > >> > > > seems a
> >> > > >> > > > > > > > bit close to still be
adding in "features that need
> >> > > maturing"
> >> > > >> > on
> >> > > >> > > > the
> >> > > >> > > > > > > > branch.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > The lack of a concrete
plan for 2.0 keeps me from
> >> > > considering
> >> > > >> > > these
> >> > > >> > > > > > > > things blocker at the
moment. But I know first hand
> >> how
> >> > > much
> >> > > >> > > > trouble
> >> > > >> > > > > > > > folks have had with other
features that have gone
> >> into
> >> > > >> > downstream
> >> > > >> > > > > > > > facing releases without
robustness checks (i.e.
> >> > > replication),
> >> > > >> > and
> >> > > >> > > > I'm
> >> > > >> > > > > > > > concerned about what we're
setting up if 2.0 goes
> out
> >> > with
> >> > > >> this
> >> > > >> > > > > > > > feature in its current
state.
> >> > > >> > > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > --
> >> > > >> > > > > Best regards,
> >> > > >> > > > >
> >> > > >> > > > >    - Andy
> >> > > >> > > > >
> >> > > >> > > > > Problems worthy of attack prove their
worth by hitting
> >> back. -
> >> > > Piet
> >> > > >> > > Hein
> >> > > >> > > > > (via Tom White)
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > >
> >> > > >> > >
> >> > > >> > >
> >> > > >> > > --
> >> > > >> > > Best regards,
> >> > > >> > >
> >> > > >> > >    - Andy
> >> > > >> > >
> >> > > >> > > Problems worthy of attack prove their worth by
hitting back.
> -
> >> > Piet
> >> > > >> Hein
> >> > > >> > > (via Tom White)
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message