hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: [DISCUSS] Plan to avoid backup/restore removal from 2.0
Date Wed, 08 Nov 2017 18:26:40 GMT
I won't speak to the timing aspects of this, that's up to the RM, but the
testing details look reasonable to me. With respect to chaos testing, the
following goals would be good:

- Some backups and restores succeed even with masters and RSes going up and
down. The resiliency can always be improved later, but we can't rely on no
failures for entire duration of backup or restore operation to get a good
result, especially for restore.

- Backups are not corrupted by failures. Or, corrupted (partial?) backups
are identified and ignored and there are still good backups remaining which
can be used for restore.

- When the verification tool says a backup and restore are good, they
really are.

On Tue, Nov 7, 2017 at 8:30 PM, Josh Elser <elserj@apache.org> wrote:

> Folks,
> I've been working with Vlad and Ted offline to make sure we have a plan
> that addresses the implementation gaps Vlad sees and the barriers-for-entry
> previously stated to keep the feature in HBase 2.0. My hope is that this
> can be an honest discussion given 2.0-beta timelines, with a concrete
> action plan. I'm trying my best to not re-hash the logic/reasoning/caveats
> behind previous concerns; anything folks feel is a blocker that I haven't
> covered below is unintentional.
> The list:
> 1. Documentation. It must be updated and committed, ensuring it covers the
> details operators/architects need to know to use it effectively
> (HBASE-16574). Vlad will help with content, myself and/or Frank will get it
> updated to asciidoc.
> 2. Distributed testing missing. Vlad has taken my previous document on
> goals and translated that into an implementation outline[1]. Ted and I have
> already weighed in -- I believe it hits the salient points for the quality
> of testing we're looking for. I'll get started on this while Vlad does #4
> (after consensus on approach, of course). Needs JIRA issue (maybe?).
> 3. Operator utility to verify backups. In abstract, this should just be
> the same guts of a tool like VerifyReplication. In practice, this should be
> the same code that #3 uses (if not _actually_ the same guts as
> VerifyReplication). The hope is that this will be encapsulated (time-wise)
> by #3. Needs JIRA issue (maybe?).
> 4. Polish DistCP for bulk-loaded files/fault-tolerance (HBASE-17852). I
> don't have specifics here -- will rely on Vlad to correct me if there's a
> better JIRA issue to track than the aforementioned. Will rely on details to
> show up the JIRA issue to track it.
> Current due dates:
> 1. End of week (2017/11/10)
> 2. Before US Thanksgiving (2017/11/22)
> 3. Same as #2
> 4. Same as #1
> My current thought is that this is reasonable for implementation times,
> and would not derail the rest of the beta-1 train. I appreciate the
> patience from all parties, and I hope that those trying to make this better
> can find a little more time to give some feedback. Thanks for the long read
> if nothing else.
> - Josh
> [1] https://docs.google.com/document/d/1xbPlLKjOcPq2LDqjbSkF6uND
> AG0mzgOxek6P3POLeMc/edit?usp=sharing

Best regards,

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message