hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vladrodio...@gmail.com>
Subject Re: [DISCUSS] Plan for Distributed testing of Backup and Restore
Date Tue, 12 Sep 2017 18:25:00 GMT
>> Vlad: I'm obviously curious to see what you think about this stuff, in
addition to what you already had in mind :)

Yes, I think that we need a test tool similar to ITBLL. Btw, making backup
working in challenging conditions was not a goal of FT design, correct
failure handling was a goal.

On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <elserj@apache.org> wrote:

> Thanks for the quick feedback!
>
> On 9/12/17 12:36 PM, Stack wrote:
>
>> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <andrew.purtell@gmail.com
>> >
>> wrote:
>>
>> I think those are reasonable criteria Josh.
>>>
>>> What I would like to see is something like "we ran ITBLL (or custom
>>> generator with similar correctness validation if you prefer) on a dev
>>> cluster (5-10 nodes) for 24 hours with server killing chaos agents
>>> active,
>>> attempted 1,440 backups (one per minute), of which 1,000 succeeded and
>>> 100%
>>> if these were successfully restored and validated." This implies your
>>> points on automation and no manual intervention. Maybe the number of
>>> successful backups under challenging conditions will be lower. Point is
>>> they demonstrate we can rely on it even when a cluster is partially
>>> unhealthy, which in production is often the normal order of affairs.
>>>
>>>
>>>
> I like it. I hadn't thought about stressing quite this aggressively, but
> now that I think about it, sounds like a great plan. Having some ballpark
> measure to quantify the cost of a "backup-heavy" workload would be cool in
> addition to seeing how the system reacts in unexpected manners.
>
> Sounds good to me.
>>
>> How will you test the restore aspect? After 1k (or whatever makes sense)
>> incremental backups over the life of the chaos, could you restore and
>> validate that the table had all expected data in place.
>>
>
> Exactly. My thinking was that, at any point, we should be able to do a
> restore and validate. Maybe something like: every Nth ITBLL iteration, make
> a new backup point, restore a previous backup point, verify, restore to
> newest backup point. The previous backup point should be a full or
> incremental point.
>
> Vlad: I'm obviously curious to see what you think about this stuff, in
> addition to what you already had in mind :)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message