hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: [DISCUSSION] Merge Backup / Restore - Branch HBASE-7912
Date Mon, 12 Sep 2016 15:03:49 GMT
Sean:
Do you have more comments ?

Cheers

On Fri, Sep 9, 2016 at 1:42 PM, Vladimir Rodionov <vladrodionov@gmail.com>
wrote:

> Sean,
>
> Backup/Restore can fail due to various reasons: network outage (cluster
> wide), various time-outs in HBase and HDFS layer, M/R failure due to "HDFS
> exceeded quota", user error (manual deletion of data) and so on so on. That
> is impossible to enumerate all possible types of failures in a distributed
> system - that is not our goal/task.
>
> We focus completely on backup system table consistency in a presence of any
> type of failure. That is what I call "tolerance to failures".
>
> On a failure:
>
> BACKUP. All backup system information (prior to backup) will be restored
> and all temporary data, related to a failed session, in HDFS will be
> deleted
> RESTORE. We do not care about system data, because restore does not change
> it. Temporary data in HDFS will be cleaned up and table will be in a state
> back to where it was before operation started.
>
> This is what user should expect in case of a failure.
>
> -Vlad
>
>
> -Vlad
>
> On Fri, Sep 9, 2016 at 12:56 PM, Sean Busbey <busbey@apache.org> wrote:
>
> > Failing in a consistent way, with docs that explain the various
> > expected failures would be sufficient.
> >
> > On Fri, Sep 9, 2016 at 12:16 PM, Vladimir Rodionov
> > <vladrodionov@gmail.com> wrote:
> > > Do not worry Sean, doc is coming today as a preview and our writer
> Frank
> > > will be working on a putting  it into Apache repo. Timeline depends on
> > > Franks schedule but I hope we will get it rather sooner than later.
> > >
> > > As for failure testing, we are focusing only on a consistent state of
> > > backup system data in a presence of any type of failures, We are not
> > going
> > > to implement  anything more "fancy", than that. We allow both: backup
> and
> > > restore to fail. What we do not allow is to have system data corrupted.
> > > Will it suffice for you? Do you have any other concerns, you want us to
> > > address?
> > >
> > > -Vlad
> > >
> > >
> > > On Fri, Sep 9, 2016 at 10:56 AM, Sean Busbey <busbey@apache.org>
> wrote:
> > >
> > >> "docs will come to Apache soon" does not address my concern around
> docs
> > at
> > >> all, unless said docs have already made it into the project repo. I
> > don't
> > >> want third party resources for using a major and important feature of
> > the
> > >> project, I want us to provide end users with what they need to get the
> > job
> > >> done.
> > >>
> > >> I see some calls for patience on the failure testing, but the appeal
> to
> > us
> > >> having done a bad job of requiring proper tests of previous features
> > just
> > >> makes me more concerned about not getting them here. I don't want to
> set
> > >> yet another bad example that will then be pointed to in the future.
> > >>
> > >> On Sep 8, 2016 10:50, "Ted Yu" <yuzhihong@gmail.com> wrote:
> > >>
> > >> > Is there any concern which is not addressed ?
> > >> >
> > >> > Do we need another Vote thread ?
> > >> >
> > >> > Thanks
> > >> >
> > >> > On Thu, Sep 8, 2016 at 9:21 AM, Andrew Purtell <apurtell@apache.org
> >
> > >> > wrote:
> > >> >
> > >> > > Vlad,
> > >> > >
> > >> > > I apologize for using the term 'half-baked' in a way that could
> > seem a
> > >> > > description of HBASE-7912. I meant that as a general hypothetical.
> > >> > >
> > >> > > On Wed, Sep 7, 2016 at 9:36 AM, Vladimir Rodionov <
> > >> > vladrodionov@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > >> I'm not sure that "There is already lots of half-baked
code
> in
> > the
> > >> > > > branch,
> > >> > > > so what's the harm in adding more?"
> > >> > > >
> > >> > > > I meant - not production - ready yet. This is 2.0 development
> > branch
> > >> > and,
> > >> > > > hence many features are in works,
> > >> > > > not being tested well etc. I do not consider backup as half
> baked
> > >> > > feature -
> > >> > > > it has passed our internal QA and has very good doc, which
we
> will
> > >> > > provide
> > >> > > > to Apache shortly.
> > >> > > >
> > >> > > > -Vlad
> > >> > > >
> > >> > > > On Wed, Sep 7, 2016 at 9:13 AM, Andrew Purtell <
> > apurtell@apache.org>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > We shouldn't admit half baked changes that won't be
finished.
> > >> However
> > >> > > in
> > >> > > > > this case the crew working on this feature are long
timers and
> > less
> > >> > > > likely
> > >> > > > > than just about anyone to leave something in a half
baked
> > state. Of
> > >> > > > course
> > >> > > > > there is no guarantee how anything will turn out, but
I am
> > willing
> > >> to
> > >> > > > take
> > >> > > > > a little on faith if they feel their best path forward
now is
> to
> > >> > merge
> > >> > > to
> > >> > > > > trunk. I only wish I had bandwidth to have done some
real
> > kicking
> > >> of
> > >> > > the
> > >> > > > > tires by now. Maybe this week.
> > >> > > > >
> > >> > > > > (Yes, I'm using some of that time for this email :-)
but I
> type
> > >> > fast.)
> > >> > > > >
> > >> > > > > That said, I would like to agitate for making 2.0 more
real
> and
> > >> spend
> > >> > > > some
> > >> > > > > time on it now that I'm winding down with 0.98. I think
that
> > means
> > >> > > > > branching for 2.0 real soon now and even evicting things
from
> > 2.0
> > >> > > branch
> > >> > > > > that aren't finished or stable, leaving them only once
again
> in
> > the
> > >> > > > master
> > >> > > > > branch. Or, maybe just evicting them. Let's take it
case by
> > case.
> > >> > > > >
> > >> > > > > I think this feature can come in relatively safely.
As added
> > >> > insurance,
> > >> > > > > let's admit the possibility it could be reverted on
the 2.0
> > branch
> > >> if
> > >> > > > folks
> > >> > > > > working on stabilizing 2.0 decide to evict it because
it is
> > >> > unfinished
> > >> > > or
> > >> > > > > unstable, because that certainly can happen. I would
expect if
> > talk
> > >> > > like
> > >> > > > > that starts, we'd get help finishing or stabilizing
what's
> under
> > >> > > > discussion
> > >> > > > > for revert. Or, we'd have a revert. Either way the
outcome is
> > >> > > acceptable.
> > >> > > > >
> > >> > > > >
> > >> > > > > On Wed, Sep 7, 2016 at 8:56 AM, Dima Spivak <
> > dimaspivak@apache.org
> > >> >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > I'm not sure that "There is already lots of half-baked
code
> in
> > >> the
> > >> > > > > branch,
> > >> > > > > > so what's the harm in adding more?" is a good
code commit
> > >> > philosophy
> > >> > > > for
> > >> > > > > a
> > >> > > > > > fault-tolerant distributed data store. ;)
> > >> > > > > >
> > >> > > > > > More seriously, a lack of test coverage for existing
> features
> > >> > > shouldn't
> > >> > > > > be
> > >> > > > > > used as justification for introducing new features
with the
> > same
> > >> > > > > > shortcomings. Ultimately, it's the end user who
will feel
> the
> > >> pain,
> > >> > > so
> > >> > > > > > shouldn't we do everything we can to mitigate
that?
> > >> > > > > >
> > >> > > > > > -Dima
> > >> > > > > >
> > >> > > > > > On Wed, Sep 7, 2016 at 8:46 AM, Vladimir Rodionov
<
> > >> > > > > vladrodionov@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Sean,
> > >> > > > > > >
> > >> > > > > > > * have docs
> > >> > > > > > >
> > >> > > > > > > Agree. We have a doc and backup is the most
documented
> > feature
> > >> > :),
> > >> > > we
> > >> > > > > > will
> > >> > > > > > > release it shortly to Apache.
> > >> > > > > > >
> > >> > > > > > > * have sunny-day correctness tests
> > >> > > > > > >
> > >> > > > > > > Feature has  close to 60 test cases, which
run for approx
> 30
> > >> min.
> > >> > > We
> > >> > > > > can
> > >> > > > > > > add more, if community do not mind :)
> > >> > > > > > >
> > >> > > > > > > * have correctness-in-face-of-failure tests
> > >> > > > > > >
> > >> > > > > > > Any examples of these tests in existing features?
In
> works,
> > we
> > >> > > have a
> > >> > > > > > clear
> > >> > > > > > > understanding of what should be done by the
time of 2.0
> > >> release.
> > >> > > > > > > That is very close goal for us, to verify
IT monkey for
> > >> existing
> > >> > > > code.
> > >> > > > > > >
> > >> > > > > > > * don't rely on things outside of HBase for
normal
> operation
> > >> > (okay
> > >> > > > for
> > >> > > > > > > advanced operation)
> > >> > > > > > >
> > >> > > > > > > We do not.
> > >> > > > > > >
> > >> > > > > > > Enormous time has been spent already on the
development
> and
> > >> > testing
> > >> > > > the
> > >> > > > > > > feature, it has passed our internal tests
and many rounds
> of
> > >> code
> > >> > > > > reviews
> > >> > > > > > > by HBase committers. We do not mind if someone
from HBase
> > >> > community
> > >> > > > > > > (outside of HW) will review the code, but
it will probably
> > >> takes
> > >> > > > > forever
> > >> > > > > > to
> > >> > > > > > > wait for volunteer?, the feature is quite
large (1MB+
> > >> cumulative
> > >> > > > patch)
> > >> > > > > > >
> > >> > > > > > > 2.0 branch is full of half baked features,
most of them
> are
> > in
> > >> > > active
> > >> > > > > > > development, therefore I am not following
you here, Sean?
> > Why
> > >> > > > > HBASE-7912
> > >> > > > > > is
> > >> > > > > > > not good enough yet to be integrated into
2.0 branch?
> > >> > > > > > >
> > >> > > > > > > -Vlad
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Wed, Sep 7, 2016 at 8:23 AM, Sean Busbey
<
> > busbey@apache.org
> > >> >
> > >> > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > On Tue, Sep 6, 2016 at 10:36 PM, Josh
Elser <
> > >> > > josh.elser@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > > > So, the answer to Sean's original
question is "as
> > robust as
> > >> > > > > snapshots
> > >> > > > > > > > > presently are"? (independence of
backup/restore
> failure
> > >> > > tolerance
> > >> > > > > > from
> > >> > > > > > > > > snapshot failure tolerance)
> > >> > > > > > > > >
> > >> > > > > > > > > Is this just a question WRT context
of the change, or
> > is it
> > >> > > means
> > >> > > > > > for a
> > >> > > > > > > > veto
> > >> > > > > > > > > from you, Sean? Just trying to
make sure I'm following
> > >> along
> > >> > > > > > > adequately.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > I'd say ATM I'm -0, bordering on -1
but not for reasons
> I
> > can
> > >> > > > > > articulate
> > >> > > > > > > > well.
> > >> > > > > > > >
> > >> > > > > > > > Here's an attempt.
> > >> > > > > > > >
> > >> > > > > > > > We've been trying to move, as a community,
towards
> > minimizing
> > >> > > risk
> > >> > > > to
> > >> > > > > > > > downstream folks by getting "complete
enough for use"
> > gates
> > >> in
> > >> > > > place
> > >> > > > > > > > before we introduce new features. This
was spurred by a
> > some
> > >> > > > features
> > >> > > > > > > > getting in half-baked and never making
it to "can really
> > use"
> > >> > > > status
> > >> > > > > > > > (I'm thinking of distributed log replay
and the zk-less
> > >> > > assignment
> > >> > > > > > > > stuff, I don't recall if there was more).
> > >> > > > > > > >
> > >> > > > > > > > The gates, generally, included things
like:
> > >> > > > > > > >
> > >> > > > > > > > * have docs
> > >> > > > > > > > * have sunny-day correctness tests
> > >> > > > > > > > * have correctness-in-face-of-failure
tests
> > >> > > > > > > > * don't rely on things outside of HBase
for normal
> > operation
> > >> > > (okay
> > >> > > > > for
> > >> > > > > > > > advanced operation)
> > >> > > > > > > >
> > >> > > > > > > > As an example, we kept the MOB work
off in a branch and
> > out
> > >> of
> > >> > > > master
> > >> > > > > > > > until it could pass these criteria.
The big exemption
> > we've
> > >> had
> > >> > > to
> > >> > > > > > > > this was the hbase-spark integration,
where we all
> agreed
> > it
> > >> > > could
> > >> > > > > > > > land in master because it was very well
isolated (the
> > slide
> > >> > away
> > >> > > > from
> > >> > > > > > > > including docs as a first-class part
of building up that
> > >> > > > integration
> > >> > > > > > > > has led me to doubt the wisdom of this
decision).
> > >> > > > > > > >
> > >> > > > > > > > We've also been treating inclusion in
a "probably will
> be
> > >> > > released
> > >> > > > to
> > >> > > > > > > > downstream" branches as a higher bar,
requiring
> > >> > > > > > > >
> > >> > > > > > > > * don't moderately impact performance
when the feature
> > isn't
> > >> in
> > >> > > use
> > >> > > > > > > > * don't severely impact performance
when the feature is
> in
> > >> use
> > >> > > > > > > > * either default-to-on or show enough
demand to believe
> a
> > >> > > > non-trivial
> > >> > > > > > > > number of folks will turn the feature
on
> > >> > > > > > > >
> > >> > > > > > > > The above has kept MOB and hbase-spark
integration out
> of
> > >> > > branch-1,
> > >> > > > > > > > presumably while they've "gotten more
stable" in master
> > from
> > >> > the
> > >> > > > odd
> > >> > > > > > > > vendor inclusion.
> > >> > > > > > > >
> > >> > > > > > > > Are we going to have a 2.0 release before
the end of the
> > >> year?
> > >> > > > We're
> > >> > > > > > > > coming up on 1.5 years since the release
of version 1.0;
> > >> seems
> > >> > > like
> > >> > > > > > > > it's about time, though I haven't seen
any concrete
> plans
> > >> this
> > >> > > > year.
> > >> > > > > > > > Presuming we are going to have one by
the end of the
> > year, it
> > >> > > > seems a
> > >> > > > > > > > bit close to still be adding in "features
that need
> > maturing"
> > >> > on
> > >> > > > the
> > >> > > > > > > > branch.
> > >> > > > > > > >
> > >> > > > > > > > The lack of a concrete plan for 2.0
keeps me from
> > considering
> > >> > > these
> > >> > > > > > > > things blocker at the moment. But I
know first hand how
> > much
> > >> > > > trouble
> > >> > > > > > > > folks have had with other features that
have gone into
> > >> > downstream
> > >> > > > > > > > facing releases without robustness checks
(i.e.
> > replication),
> > >> > and
> > >> > > > I'm
> > >> > > > > > > > concerned about what we're setting up
if 2.0 goes out
> with
> > >> this
> > >> > > > > > > > feature in its current state.
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Best regards,
> > >> > > > >
> > >> > > > >    - Andy
> > >> > > > >
> > >> > > > > Problems worthy of attack prove their worth by hitting
back. -
> > Piet
> > >> > > Hein
> > >> > > > > (via Tom White)
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Best regards,
> > >> > >
> > >> > >    - Andy
> > >> > >
> > >> > > Problems worthy of attack prove their worth by hitting back.
-
> Piet
> > >> Hein
> > >> > > (via Tom White)
> > >> > >
> > >> >
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message