hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vladrodio...@gmail.com>
Subject Re: [DISCUSSION] Merge Backup / Restore - Branch HBASE-7912
Date Fri, 09 Sep 2016 20:42:17 GMT
Sean,

Backup/Restore can fail due to various reasons: network outage (cluster
wide), various time-outs in HBase and HDFS layer, M/R failure due to "HDFS
exceeded quota", user error (manual deletion of data) and so on so on. That
is impossible to enumerate all possible types of failures in a distributed
system - that is not our goal/task.

We focus completely on backup system table consistency in a presence of any
type of failure. That is what I call "tolerance to failures".

On a failure:

BACKUP. All backup system information (prior to backup) will be restored
and all temporary data, related to a failed session, in HDFS will be deleted
RESTORE. We do not care about system data, because restore does not change
it. Temporary data in HDFS will be cleaned up and table will be in a state
back to where it was before operation started.

This is what user should expect in case of a failure.

-Vlad


-Vlad

On Fri, Sep 9, 2016 at 12:56 PM, Sean Busbey <busbey@apache.org> wrote:

> Failing in a consistent way, with docs that explain the various
> expected failures would be sufficient.
>
> On Fri, Sep 9, 2016 at 12:16 PM, Vladimir Rodionov
> <vladrodionov@gmail.com> wrote:
> > Do not worry Sean, doc is coming today as a preview and our writer Frank
> > will be working on a putting  it into Apache repo. Timeline depends on
> > Franks schedule but I hope we will get it rather sooner than later.
> >
> > As for failure testing, we are focusing only on a consistent state of
> > backup system data in a presence of any type of failures, We are not
> going
> > to implement  anything more "fancy", than that. We allow both: backup and
> > restore to fail. What we do not allow is to have system data corrupted.
> > Will it suffice for you? Do you have any other concerns, you want us to
> > address?
> >
> > -Vlad
> >
> >
> > On Fri, Sep 9, 2016 at 10:56 AM, Sean Busbey <busbey@apache.org> wrote:
> >
> >> "docs will come to Apache soon" does not address my concern around docs
> at
> >> all, unless said docs have already made it into the project repo. I
> don't
> >> want third party resources for using a major and important feature of
> the
> >> project, I want us to provide end users with what they need to get the
> job
> >> done.
> >>
> >> I see some calls for patience on the failure testing, but the appeal to
> us
> >> having done a bad job of requiring proper tests of previous features
> just
> >> makes me more concerned about not getting them here. I don't want to set
> >> yet another bad example that will then be pointed to in the future.
> >>
> >> On Sep 8, 2016 10:50, "Ted Yu" <yuzhihong@gmail.com> wrote:
> >>
> >> > Is there any concern which is not addressed ?
> >> >
> >> > Do we need another Vote thread ?
> >> >
> >> > Thanks
> >> >
> >> > On Thu, Sep 8, 2016 at 9:21 AM, Andrew Purtell <apurtell@apache.org>
> >> > wrote:
> >> >
> >> > > Vlad,
> >> > >
> >> > > I apologize for using the term 'half-baked' in a way that could
> seem a
> >> > > description of HBASE-7912. I meant that as a general hypothetical.
> >> > >
> >> > > On Wed, Sep 7, 2016 at 9:36 AM, Vladimir Rodionov <
> >> > vladrodionov@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > >> I'm not sure that "There is already lots of half-baked
code in
> the
> >> > > > branch,
> >> > > > so what's the harm in adding more?"
> >> > > >
> >> > > > I meant - not production - ready yet. This is 2.0 development
> branch
> >> > and,
> >> > > > hence many features are in works,
> >> > > > not being tested well etc. I do not consider backup as half baked
> >> > > feature -
> >> > > > it has passed our internal QA and has very good doc, which we
will
> >> > > provide
> >> > > > to Apache shortly.
> >> > > >
> >> > > > -Vlad
> >> > > >
> >> > > > On Wed, Sep 7, 2016 at 9:13 AM, Andrew Purtell <
> apurtell@apache.org>
> >> > > > wrote:
> >> > > >
> >> > > > > We shouldn't admit half baked changes that won't be finished.
> >> However
> >> > > in
> >> > > > > this case the crew working on this feature are long timers
and
> less
> >> > > > likely
> >> > > > > than just about anyone to leave something in a half baked
> state. Of
> >> > > > course
> >> > > > > there is no guarantee how anything will turn out, but I
am
> willing
> >> to
> >> > > > take
> >> > > > > a little on faith if they feel their best path forward now
is to
> >> > merge
> >> > > to
> >> > > > > trunk. I only wish I had bandwidth to have done some real
> kicking
> >> of
> >> > > the
> >> > > > > tires by now. Maybe this week.
> >> > > > >
> >> > > > > (Yes, I'm using some of that time for this email :-) but
I type
> >> > fast.)
> >> > > > >
> >> > > > > That said, I would like to agitate for making 2.0 more real
and
> >> spend
> >> > > > some
> >> > > > > time on it now that I'm winding down with 0.98. I think
that
> means
> >> > > > > branching for 2.0 real soon now and even evicting things
from
> 2.0
> >> > > branch
> >> > > > > that aren't finished or stable, leaving them only once again
in
> the
> >> > > > master
> >> > > > > branch. Or, maybe just evicting them. Let's take it case
by
> case.
> >> > > > >
> >> > > > > I think this feature can come in relatively safely. As added
> >> > insurance,
> >> > > > > let's admit the possibility it could be reverted on the
2.0
> branch
> >> if
> >> > > > folks
> >> > > > > working on stabilizing 2.0 decide to evict it because it
is
> >> > unfinished
> >> > > or
> >> > > > > unstable, because that certainly can happen. I would expect
if
> talk
> >> > > like
> >> > > > > that starts, we'd get help finishing or stabilizing what's
under
> >> > > > discussion
> >> > > > > for revert. Or, we'd have a revert. Either way the outcome
is
> >> > > acceptable.
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Sep 7, 2016 at 8:56 AM, Dima Spivak <
> dimaspivak@apache.org
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > I'm not sure that "There is already lots of half-baked
code in
> >> the
> >> > > > > branch,
> >> > > > > > so what's the harm in adding more?" is a good code
commit
> >> > philosophy
> >> > > > for
> >> > > > > a
> >> > > > > > fault-tolerant distributed data store. ;)
> >> > > > > >
> >> > > > > > More seriously, a lack of test coverage for existing
features
> >> > > shouldn't
> >> > > > > be
> >> > > > > > used as justification for introducing new features
with the
> same
> >> > > > > > shortcomings. Ultimately, it's the end user who will
feel the
> >> pain,
> >> > > so
> >> > > > > > shouldn't we do everything we can to mitigate that?
> >> > > > > >
> >> > > > > > -Dima
> >> > > > > >
> >> > > > > > On Wed, Sep 7, 2016 at 8:46 AM, Vladimir Rodionov <
> >> > > > > vladrodionov@gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Sean,
> >> > > > > > >
> >> > > > > > > * have docs
> >> > > > > > >
> >> > > > > > > Agree. We have a doc and backup is the most documented
> feature
> >> > :),
> >> > > we
> >> > > > > > will
> >> > > > > > > release it shortly to Apache.
> >> > > > > > >
> >> > > > > > > * have sunny-day correctness tests
> >> > > > > > >
> >> > > > > > > Feature has  close to 60 test cases, which run
for approx 30
> >> min.
> >> > > We
> >> > > > > can
> >> > > > > > > add more, if community do not mind :)
> >> > > > > > >
> >> > > > > > > * have correctness-in-face-of-failure tests
> >> > > > > > >
> >> > > > > > > Any examples of these tests in existing features?
In works,
> we
> >> > > have a
> >> > > > > > clear
> >> > > > > > > understanding of what should be done by the time
of 2.0
> >> release.
> >> > > > > > > That is very close goal for us, to verify IT monkey
for
> >> existing
> >> > > > code.
> >> > > > > > >
> >> > > > > > > * don't rely on things outside of HBase for normal
operation
> >> > (okay
> >> > > > for
> >> > > > > > > advanced operation)
> >> > > > > > >
> >> > > > > > > We do not.
> >> > > > > > >
> >> > > > > > > Enormous time has been spent already on the development
and
> >> > testing
> >> > > > the
> >> > > > > > > feature, it has passed our internal tests and
many rounds of
> >> code
> >> > > > > reviews
> >> > > > > > > by HBase committers. We do not mind if someone
from HBase
> >> > community
> >> > > > > > > (outside of HW) will review the code, but it will
probably
> >> takes
> >> > > > > forever
> >> > > > > > to
> >> > > > > > > wait for volunteer?, the feature is quite large
(1MB+
> >> cumulative
> >> > > > patch)
> >> > > > > > >
> >> > > > > > > 2.0 branch is full of half baked features, most
of them are
> in
> >> > > active
> >> > > > > > > development, therefore I am not following you
here, Sean?
> Why
> >> > > > > HBASE-7912
> >> > > > > > is
> >> > > > > > > not good enough yet to be integrated into 2.0
branch?
> >> > > > > > >
> >> > > > > > > -Vlad
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Wed, Sep 7, 2016 at 8:23 AM, Sean Busbey <
> busbey@apache.org
> >> >
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > On Tue, Sep 6, 2016 at 10:36 PM, Josh Elser
<
> >> > > josh.elser@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > > > > So, the answer to Sean's original question
is "as
> robust as
> >> > > > > snapshots
> >> > > > > > > > > presently are"? (independence of backup/restore
failure
> >> > > tolerance
> >> > > > > > from
> >> > > > > > > > > snapshot failure tolerance)
> >> > > > > > > > >
> >> > > > > > > > > Is this just a question WRT context
of the change, or
> is it
> >> > > means
> >> > > > > > for a
> >> > > > > > > > veto
> >> > > > > > > > > from you, Sean? Just trying to make
sure I'm following
> >> along
> >> > > > > > > adequately.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > I'd say ATM I'm -0, bordering on -1 but not
for reasons I
> can
> >> > > > > > articulate
> >> > > > > > > > well.
> >> > > > > > > >
> >> > > > > > > > Here's an attempt.
> >> > > > > > > >
> >> > > > > > > > We've been trying to move, as a community,
towards
> minimizing
> >> > > risk
> >> > > > to
> >> > > > > > > > downstream folks by getting "complete enough
for use"
> gates
> >> in
> >> > > > place
> >> > > > > > > > before we introduce new features. This was
spurred by a
> some
> >> > > > features
> >> > > > > > > > getting in half-baked and never making it
to "can really
> use"
> >> > > > status
> >> > > > > > > > (I'm thinking of distributed log replay and
the zk-less
> >> > > assignment
> >> > > > > > > > stuff, I don't recall if there was more).
> >> > > > > > > >
> >> > > > > > > > The gates, generally, included things like:
> >> > > > > > > >
> >> > > > > > > > * have docs
> >> > > > > > > > * have sunny-day correctness tests
> >> > > > > > > > * have correctness-in-face-of-failure tests
> >> > > > > > > > * don't rely on things outside of HBase for
normal
> operation
> >> > > (okay
> >> > > > > for
> >> > > > > > > > advanced operation)
> >> > > > > > > >
> >> > > > > > > > As an example, we kept the MOB work off in
a branch and
> out
> >> of
> >> > > > master
> >> > > > > > > > until it could pass these criteria. The big
exemption
> we've
> >> had
> >> > > to
> >> > > > > > > > this was the hbase-spark integration, where
we all agreed
> it
> >> > > could
> >> > > > > > > > land in master because it was very well isolated
(the
> slide
> >> > away
> >> > > > from
> >> > > > > > > > including docs as a first-class part of building
up that
> >> > > > integration
> >> > > > > > > > has led me to doubt the wisdom of this decision).
> >> > > > > > > >
> >> > > > > > > > We've also been treating inclusion in a "probably
will be
> >> > > released
> >> > > > to
> >> > > > > > > > downstream" branches as a higher bar, requiring
> >> > > > > > > >
> >> > > > > > > > * don't moderately impact performance when
the feature
> isn't
> >> in
> >> > > use
> >> > > > > > > > * don't severely impact performance when
the feature is in
> >> use
> >> > > > > > > > * either default-to-on or show enough demand
to believe a
> >> > > > non-trivial
> >> > > > > > > > number of folks will turn the feature on
> >> > > > > > > >
> >> > > > > > > > The above has kept MOB and hbase-spark integration
out of
> >> > > branch-1,
> >> > > > > > > > presumably while they've "gotten more stable"
in master
> from
> >> > the
> >> > > > odd
> >> > > > > > > > vendor inclusion.
> >> > > > > > > >
> >> > > > > > > > Are we going to have a 2.0 release before
the end of the
> >> year?
> >> > > > We're
> >> > > > > > > > coming up on 1.5 years since the release
of version 1.0;
> >> seems
> >> > > like
> >> > > > > > > > it's about time, though I haven't seen any
concrete plans
> >> this
> >> > > > year.
> >> > > > > > > > Presuming we are going to have one by the
end of the
> year, it
> >> > > > seems a
> >> > > > > > > > bit close to still be adding in "features
that need
> maturing"
> >> > on
> >> > > > the
> >> > > > > > > > branch.
> >> > > > > > > >
> >> > > > > > > > The lack of a concrete plan for 2.0 keeps
me from
> considering
> >> > > these
> >> > > > > > > > things blocker at the moment. But I know
first hand how
> much
> >> > > > trouble
> >> > > > > > > > folks have had with other features that have
gone into
> >> > downstream
> >> > > > > > > > facing releases without robustness checks
(i.e.
> replication),
> >> > and
> >> > > > I'm
> >> > > > > > > > concerned about what we're setting up if
2.0 goes out with
> >> this
> >> > > > > > > > feature in its current state.
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Best regards,
> >> > > > >
> >> > > > >    - Andy
> >> > > > >
> >> > > > > Problems worthy of attack prove their worth by hitting back.
-
> Piet
> >> > > Hein
> >> > > > > (via Tom White)
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best regards,
> >> > >
> >> > >    - Andy
> >> > >
> >> > > Problems worthy of attack prove their worth by hitting back. - Piet
> >> Hein
> >> > > (via Tom White)
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message