hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: [DISCUSSION] Merge Backup / Restore - Branch HBASE-7912
Date Mon, 12 Sep 2016 19:01:05 GMT
Late to the game. A few comments after rereading this thread as a 'user'.

+ Before merge, a user-facing feature like this should work (If this
is "higher-bar
for new features", bring it on -- smile).
+ As a user, I tried the branch with tools after reviewing the just-posted
doc. I had an 'interesting' experience (left comments up on issue). I think
the tooling/doc. important to get right. If it breaks easily or is
inconsistent (or lacks 'polish'), operators will judge the whole
backup/restore tooling chain as not trustworthy and abandon it. Lets not
have this happen to this feature.
+ Matteo's suggestion (with a helpful starter list) that there needs to be
explicit qualification on what is actually being delivered -- including a
listing of limitations (some look serious such as data bleed from other
regions in WALs, but maybe I don't care for my use case...) -- needs to
accompany the merge. Lets fold them into the user doc. in the technical
overview area as suggested so user expectations are properly managed
(otherwise, they expect the world and will just give up when we fall
short). Vladimir did a list of what is in each of the phases above which
would serve as a good start.
+ Is this feature 'experimental' (Matteo asks above). I'd prefer it is not.
If it is, it should be labelled all over that it is so. I see current state
called out as a '... technical preview feature'. Does this mean
not-for-users?

St.Ack











On Mon, Sep 12, 2016 at 8:03 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Sean:
> Do you have more comments ?
>
> Cheers
>
> On Fri, Sep 9, 2016 at 1:42 PM, Vladimir Rodionov <vladrodionov@gmail.com>
> wrote:
>
> > Sean,
> >
> > Backup/Restore can fail due to various reasons: network outage (cluster
> > wide), various time-outs in HBase and HDFS layer, M/R failure due to
> "HDFS
> > exceeded quota", user error (manual deletion of data) and so on so on.
> That
> > is impossible to enumerate all possible types of failures in a
> distributed
> > system - that is not our goal/task.
> >
> > We focus completely on backup system table consistency in a presence of
> any
> > type of failure. That is what I call "tolerance to failures".
> >
> > On a failure:
> >
> > BACKUP. All backup system information (prior to backup) will be restored
> > and all temporary data, related to a failed session, in HDFS will be
> > deleted
> > RESTORE. We do not care about system data, because restore does not
> change
> > it. Temporary data in HDFS will be cleaned up and table will be in a
> state
> > back to where it was before operation started.
> >
> > This is what user should expect in case of a failure.
> >
> > -Vlad
> >
> >
> > -Vlad
> >
> > On Fri, Sep 9, 2016 at 12:56 PM, Sean Busbey <busbey@apache.org> wrote:
> >
> > > Failing in a consistent way, with docs that explain the various
> > > expected failures would be sufficient.
> > >
> > > On Fri, Sep 9, 2016 at 12:16 PM, Vladimir Rodionov
> > > <vladrodionov@gmail.com> wrote:
> > > > Do not worry Sean, doc is coming today as a preview and our writer
> > Frank
> > > > will be working on a putting  it into Apache repo. Timeline depends
> on
> > > > Franks schedule but I hope we will get it rather sooner than later.
> > > >
> > > > As for failure testing, we are focusing only on a consistent state of
> > > > backup system data in a presence of any type of failures, We are not
> > > going
> > > > to implement  anything more "fancy", than that. We allow both: backup
> > and
> > > > restore to fail. What we do not allow is to have system data
> corrupted.
> > > > Will it suffice for you? Do you have any other concerns, you want us
> to
> > > > address?
> > > >
> > > > -Vlad
> > > >
> > > >
> > > > On Fri, Sep 9, 2016 at 10:56 AM, Sean Busbey <busbey@apache.org>
> > wrote:
> > > >
> > > >> "docs will come to Apache soon" does not address my concern around
> > docs
> > > at
> > > >> all, unless said docs have already made it into the project repo.
I
> > > don't
> > > >> want third party resources for using a major and important feature
> of
> > > the
> > > >> project, I want us to provide end users with what they need to get
> the
> > > job
> > > >> done.
> > > >>
> > > >> I see some calls for patience on the failure testing, but the appeal
> > to
> > > us
> > > >> having done a bad job of requiring proper tests of previous features
> > > just
> > > >> makes me more concerned about not getting them here. I don't want
to
> > set
> > > >> yet another bad example that will then be pointed to in the future.
> > > >>
> > > >> On Sep 8, 2016 10:50, "Ted Yu" <yuzhihong@gmail.com> wrote:
> > > >>
> > > >> > Is there any concern which is not addressed ?
> > > >> >
> > > >> > Do we need another Vote thread ?
> > > >> >
> > > >> > Thanks
> > > >> >
> > > >> > On Thu, Sep 8, 2016 at 9:21 AM, Andrew Purtell <
> apurtell@apache.org
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Vlad,
> > > >> > >
> > > >> > > I apologize for using the term 'half-baked' in a way that
could
> > > seem a
> > > >> > > description of HBASE-7912. I meant that as a general
> hypothetical.
> > > >> > >
> > > >> > > On Wed, Sep 7, 2016 at 9:36 AM, Vladimir Rodionov <
> > > >> > vladrodionov@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > >> I'm not sure that "There is already lots of
half-baked code
> > in
> > > the
> > > >> > > > branch,
> > > >> > > > so what's the harm in adding more?"
> > > >> > > >
> > > >> > > > I meant - not production - ready yet. This is 2.0 development
> > > branch
> > > >> > and,
> > > >> > > > hence many features are in works,
> > > >> > > > not being tested well etc. I do not consider backup
as half
> > baked
> > > >> > > feature -
> > > >> > > > it has passed our internal QA and has very good doc,
which we
> > will
> > > >> > > provide
> > > >> > > > to Apache shortly.
> > > >> > > >
> > > >> > > > -Vlad
> > > >> > > >
> > > >> > > > On Wed, Sep 7, 2016 at 9:13 AM, Andrew Purtell <
> > > apurtell@apache.org>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > We shouldn't admit half baked changes that won't
be
> finished.
> > > >> However
> > > >> > > in
> > > >> > > > > this case the crew working on this feature are
long timers
> and
> > > less
> > > >> > > > likely
> > > >> > > > > than just about anyone to leave something in a
half baked
> > > state. Of
> > > >> > > > course
> > > >> > > > > there is no guarantee how anything will turn out,
but I am
> > > willing
> > > >> to
> > > >> > > > take
> > > >> > > > > a little on faith if they feel their best path
forward now
> is
> > to
> > > >> > merge
> > > >> > > to
> > > >> > > > > trunk. I only wish I had bandwidth to have done
some real
> > > kicking
> > > >> of
> > > >> > > the
> > > >> > > > > tires by now. Maybe this week.
> > > >> > > > >
> > > >> > > > > (Yes, I'm using some of that time for this email
:-) but I
> > type
> > > >> > fast.)
> > > >> > > > >
> > > >> > > > > That said, I would like to agitate for making
2.0 more real
> > and
> > > >> spend
> > > >> > > > some
> > > >> > > > > time on it now that I'm winding down with 0.98.
I think that
> > > means
> > > >> > > > > branching for 2.0 real soon now and even evicting
things
> from
> > > 2.0
> > > >> > > branch
> > > >> > > > > that aren't finished or stable, leaving them only
once again
> > in
> > > the
> > > >> > > > master
> > > >> > > > > branch. Or, maybe just evicting them. Let's take
it case by
> > > case.
> > > >> > > > >
> > > >> > > > > I think this feature can come in relatively safely.
As added
> > > >> > insurance,
> > > >> > > > > let's admit the possibility it could be reverted
on the 2.0
> > > branch
> > > >> if
> > > >> > > > folks
> > > >> > > > > working on stabilizing 2.0 decide to evict it
because it is
> > > >> > unfinished
> > > >> > > or
> > > >> > > > > unstable, because that certainly can happen. I
would expect
> if
> > > talk
> > > >> > > like
> > > >> > > > > that starts, we'd get help finishing or stabilizing
what's
> > under
> > > >> > > > discussion
> > > >> > > > > for revert. Or, we'd have a revert. Either way
the outcome
> is
> > > >> > > acceptable.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Wed, Sep 7, 2016 at 8:56 AM, Dima Spivak <
> > > dimaspivak@apache.org
> > > >> >
> > > >> > > > wrote:
> > > >> > > > >
> > > >> > > > > > I'm not sure that "There is already lots
of half-baked
> code
> > in
> > > >> the
> > > >> > > > > branch,
> > > >> > > > > > so what's the harm in adding more?" is a
good code commit
> > > >> > philosophy
> > > >> > > > for
> > > >> > > > > a
> > > >> > > > > > fault-tolerant distributed data store. ;)
> > > >> > > > > >
> > > >> > > > > > More seriously, a lack of test coverage for
existing
> > features
> > > >> > > shouldn't
> > > >> > > > > be
> > > >> > > > > > used as justification for introducing new
features with
> the
> > > same
> > > >> > > > > > shortcomings. Ultimately, it's the end user
who will feel
> > the
> > > >> pain,
> > > >> > > so
> > > >> > > > > > shouldn't we do everything we can to mitigate
that?
> > > >> > > > > >
> > > >> > > > > > -Dima
> > > >> > > > > >
> > > >> > > > > > On Wed, Sep 7, 2016 at 8:46 AM, Vladimir
Rodionov <
> > > >> > > > > vladrodionov@gmail.com>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Sean,
> > > >> > > > > > >
> > > >> > > > > > > * have docs
> > > >> > > > > > >
> > > >> > > > > > > Agree. We have a doc and backup is the
most documented
> > > feature
> > > >> > :),
> > > >> > > we
> > > >> > > > > > will
> > > >> > > > > > > release it shortly to Apache.
> > > >> > > > > > >
> > > >> > > > > > > * have sunny-day correctness tests
> > > >> > > > > > >
> > > >> > > > > > > Feature has  close to 60 test cases,
which run for
> approx
> > 30
> > > >> min.
> > > >> > > We
> > > >> > > > > can
> > > >> > > > > > > add more, if community do not mind :)
> > > >> > > > > > >
> > > >> > > > > > > * have correctness-in-face-of-failure
tests
> > > >> > > > > > >
> > > >> > > > > > > Any examples of these tests in existing
features? In
> > works,
> > > we
> > > >> > > have a
> > > >> > > > > > clear
> > > >> > > > > > > understanding of what should be done
by the time of 2.0
> > > >> release.
> > > >> > > > > > > That is very close goal for us, to verify
IT monkey for
> > > >> existing
> > > >> > > > code.
> > > >> > > > > > >
> > > >> > > > > > > * don't rely on things outside of HBase
for normal
> > operation
> > > >> > (okay
> > > >> > > > for
> > > >> > > > > > > advanced operation)
> > > >> > > > > > >
> > > >> > > > > > > We do not.
> > > >> > > > > > >
> > > >> > > > > > > Enormous time has been spent already
on the development
> > and
> > > >> > testing
> > > >> > > > the
> > > >> > > > > > > feature, it has passed our internal
tests and many
> rounds
> > of
> > > >> code
> > > >> > > > > reviews
> > > >> > > > > > > by HBase committers. We do not mind
if someone from
> HBase
> > > >> > community
> > > >> > > > > > > (outside of HW) will review the code,
but it will
> probably
> > > >> takes
> > > >> > > > > forever
> > > >> > > > > > to
> > > >> > > > > > > wait for volunteer?, the feature is
quite large (1MB+
> > > >> cumulative
> > > >> > > > patch)
> > > >> > > > > > >
> > > >> > > > > > > 2.0 branch is full of half baked features,
most of them
> > are
> > > in
> > > >> > > active
> > > >> > > > > > > development, therefore I am not following
you here,
> Sean?
> > > Why
> > > >> > > > > HBASE-7912
> > > >> > > > > > is
> > > >> > > > > > > not good enough yet to be integrated
into 2.0 branch?
> > > >> > > > > > >
> > > >> > > > > > > -Vlad
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > On Wed, Sep 7, 2016 at 8:23 AM, Sean
Busbey <
> > > busbey@apache.org
> > > >> >
> > > >> > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > On Tue, Sep 6, 2016 at 10:36 PM,
Josh Elser <
> > > >> > > josh.elser@gmail.com>
> > > >> > > > > > > wrote:
> > > >> > > > > > > > > So, the answer to Sean's original
question is "as
> > > robust as
> > > >> > > > > snapshots
> > > >> > > > > > > > > presently are"? (independence
of backup/restore
> > failure
> > > >> > > tolerance
> > > >> > > > > > from
> > > >> > > > > > > > > snapshot failure tolerance)
> > > >> > > > > > > > >
> > > >> > > > > > > > > Is this just a question WRT
context of the change,
> or
> > > is it
> > > >> > > means
> > > >> > > > > > for a
> > > >> > > > > > > > veto
> > > >> > > > > > > > > from you, Sean? Just trying
to make sure I'm
> following
> > > >> along
> > > >> > > > > > > adequately.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > I'd say ATM I'm -0, bordering on
-1 but not for
> reasons
> > I
> > > can
> > > >> > > > > > articulate
> > > >> > > > > > > > well.
> > > >> > > > > > > >
> > > >> > > > > > > > Here's an attempt.
> > > >> > > > > > > >
> > > >> > > > > > > > We've been trying to move, as a
community, towards
> > > minimizing
> > > >> > > risk
> > > >> > > > to
> > > >> > > > > > > > downstream folks by getting "complete
enough for use"
> > > gates
> > > >> in
> > > >> > > > place
> > > >> > > > > > > > before we introduce new features.
This was spurred by
> a
> > > some
> > > >> > > > features
> > > >> > > > > > > > getting in half-baked and never
making it to "can
> really
> > > use"
> > > >> > > > status
> > > >> > > > > > > > (I'm thinking of distributed log
replay and the
> zk-less
> > > >> > > assignment
> > > >> > > > > > > > stuff, I don't recall if there
was more).
> > > >> > > > > > > >
> > > >> > > > > > > > The gates, generally, included
things like:
> > > >> > > > > > > >
> > > >> > > > > > > > * have docs
> > > >> > > > > > > > * have sunny-day correctness tests
> > > >> > > > > > > > * have correctness-in-face-of-failure
tests
> > > >> > > > > > > > * don't rely on things outside
of HBase for normal
> > > operation
> > > >> > > (okay
> > > >> > > > > for
> > > >> > > > > > > > advanced operation)
> > > >> > > > > > > >
> > > >> > > > > > > > As an example, we kept the MOB
work off in a branch
> and
> > > out
> > > >> of
> > > >> > > > master
> > > >> > > > > > > > until it could pass these criteria.
The big exemption
> > > we've
> > > >> had
> > > >> > > to
> > > >> > > > > > > > this was the hbase-spark integration,
where we all
> > agreed
> > > it
> > > >> > > could
> > > >> > > > > > > > land in master because it was very
well isolated (the
> > > slide
> > > >> > away
> > > >> > > > from
> > > >> > > > > > > > including docs as a first-class
part of building up
> that
> > > >> > > > integration
> > > >> > > > > > > > has led me to doubt the wisdom
of this decision).
> > > >> > > > > > > >
> > > >> > > > > > > > We've also been treating inclusion
in a "probably will
> > be
> > > >> > > released
> > > >> > > > to
> > > >> > > > > > > > downstream" branches as a higher
bar, requiring
> > > >> > > > > > > >
> > > >> > > > > > > > * don't moderately impact performance
when the feature
> > > isn't
> > > >> in
> > > >> > > use
> > > >> > > > > > > > * don't severely impact performance
when the feature
> is
> > in
> > > >> use
> > > >> > > > > > > > * either default-to-on or show
enough demand to
> believe
> > a
> > > >> > > > non-trivial
> > > >> > > > > > > > number of folks will turn the feature
on
> > > >> > > > > > > >
> > > >> > > > > > > > The above has kept MOB and hbase-spark
integration out
> > of
> > > >> > > branch-1,
> > > >> > > > > > > > presumably while they've "gotten
more stable" in
> master
> > > from
> > > >> > the
> > > >> > > > odd
> > > >> > > > > > > > vendor inclusion.
> > > >> > > > > > > >
> > > >> > > > > > > > Are we going to have a 2.0 release
before the end of
> the
> > > >> year?
> > > >> > > > We're
> > > >> > > > > > > > coming up on 1.5 years since the
release of version
> 1.0;
> > > >> seems
> > > >> > > like
> > > >> > > > > > > > it's about time, though I haven't
seen any concrete
> > plans
> > > >> this
> > > >> > > > year.
> > > >> > > > > > > > Presuming we are going to have
one by the end of the
> > > year, it
> > > >> > > > seems a
> > > >> > > > > > > > bit close to still be adding in
"features that need
> > > maturing"
> > > >> > on
> > > >> > > > the
> > > >> > > > > > > > branch.
> > > >> > > > > > > >
> > > >> > > > > > > > The lack of a concrete plan for
2.0 keeps me from
> > > considering
> > > >> > > these
> > > >> > > > > > > > things blocker at the moment. But
I know first hand
> how
> > > much
> > > >> > > > trouble
> > > >> > > > > > > > folks have had with other features
that have gone into
> > > >> > downstream
> > > >> > > > > > > > facing releases without robustness
checks (i.e.
> > > replication),
> > > >> > and
> > > >> > > > I'm
> > > >> > > > > > > > concerned about what we're setting
up if 2.0 goes out
> > with
> > > >> this
> > > >> > > > > > > > feature in its current state.
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > --
> > > >> > > > > Best regards,
> > > >> > > > >
> > > >> > > > >    - Andy
> > > >> > > > >
> > > >> > > > > Problems worthy of attack prove their worth by
hitting
> back. -
> > > Piet
> > > >> > > Hein
> > > >> > > > > (via Tom White)
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Best regards,
> > > >> > >
> > > >> > >    - Andy
> > > >> > >
> > > >> > > Problems worthy of attack prove their worth by hitting back.
-
> > Piet
> > > >> Hein
> > > >> > > (via Tom White)
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message