hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: [DISCUSS] Plan for Distributed testing of Backup and Restore
Date Tue, 12 Sep 2017 16:36:45 GMT
On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <andrew.purtell@gmail.com>
wrote:

> I think those are reasonable criteria Josh.
>
> What I would like to see is something like "we ran ITBLL (or custom
> generator with similar correctness validation if you prefer) on a dev
> cluster (5-10 nodes) for 24 hours with server killing chaos agents active,
> attempted 1,440 backups (one per minute), of which 1,000 succeeded and 100%
> if these were successfully restored and validated." This implies your
> points on automation and no manual intervention. Maybe the number of
> successful backups under challenging conditions will be lower. Point is
> they demonstrate we can rely on it even when a cluster is partially
> unhealthy, which in production is often the normal order of affairs.
>
>
Sounds good to me.

How will you test the restore aspect? After 1k (or whatever makes sense)
incremental backups over the life of the chaos, could you restore and
validate that the table had all expected data in place.

Thanks,
St.Ack



>
> > On Sep 12, 2017, at 9:07 AM, Josh Elser <elserj@apache.org> wrote:
> >
> >> On 9/11/17 11:52 PM, Stack wrote:
> >> On Mon, Sep 11, 2017 at 11:07 AM, Vladimir Rodionov <
> vladrodionov@gmail.com>
> >> wrote:
> >>> ...
> >>> That is mostly it. Yes, We have not done real testing with real data
> on a
> >>> real cluster yet, except QA  testing on a small OpenStack
> >>> cluster (10 nodes). That is our probably the biggest minus right now. I
> >>> would like to inform community that this week we are going to start
> >>> full scale testing with reasonably sized data sets.
> >>>
> >> ... Completion of HA seems important as is result of the scale testing.
> >
> > I think we should knock out a rough sketch on what effective "scale"
> testing would look like since that is a very subjective phrase. Let me
> start the ball rolling with a few things that come to my mind.
> >
> > (interpreting requirements as per rfc2119)
> >
> > * MUST have >5 RegionServers and >1 Masters in play
> > * MUST have Non-trivial final data sizes (final data size would be >=
> 100's of GB)
> > * MUST have some clear pass/fail determination for correctness of B&R
> > * MUST have some fault-injection
> >
> > * SHOULD be a completely automated test, not require coordination of a
> human to executing commands.
> > * SHOULD be able to acquire operational insight (metrics) while
> performing operations to determine success of testing
> > * SHOULD NOT require manual intervention, e.g. working around known
> issues/limitations
> > * SHOULD reuse the IntegrationTest framework in hbase-it
> >
> > Since we have a concern of correctness, ITBLL sounds like a good
> starting point to avoid having to re-write similar kinds of logic.
> ChaosMonkey is always great for fault-injection.
> >
> > Thoughts?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message