hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <els...@apache.org>
Subject Re: [DISCUSS] Plan for Distributed testing of Backup and Restore
Date Tue, 12 Sep 2017 19:29:10 GMT


On 9/12/17 2:51 PM, Andrew Purtell wrote:
>> making backup working in challenging conditions was not a goal of FT
> design, correct failure handling was a goal.
> 
> Every real-world production environment has challenging conditions.
> 
> That said, making progress in the face of failures is only one aspect of
> FT, and an equally valid one is that failures do not cause data corruption.
> 
> If testing with chaos proves this backup solution will fail if there is any
> failure while backup is in progress, but at least it will successfully
> clean up and not corrupt existing state - that could be ok, for some.
> Possibly, us.

Agreed. There are always differences of opinion around acceptable levels 
of tolerance. Understanding how things fail (avoiding the need for 
manual interaction to correct) is a good initial goal-post as we can 
concisely document that for users. My impression is that this wouldn't 
require a significant amount of work to achieve an acceptable degree of 
stability.

> If testing with chaos proves this backup solution will not suffer
> corruption if there is a failure *and* can still successfully complete if
> there is any failure while backup is in progress - that would obviously
> improve the perceived value proposition.
> 
> It would be fine to test this using hbase-it chaos facilities but with a
> less aggressive policy than slowDeterministic that allows for backups to
> successfully complete once in a while yet also demonstrate that when the
> failures do happen things are properly cleaned up and data corruption does
> not happen.
> 
> 
> 
> 
> On Tue, Sep 12, 2017 at 11:25 AM, Vladimir Rodionov <vladrodionov@gmail.com>
> wrote:
> 
>>>> Vlad: I'm obviously curious to see what you think about this stuff, in
>> addition to what you already had in mind :)
>>
>> Yes, I think that we need a test tool similar to ITBLL. Btw, making backup
>> working in challenging conditions was not a goal of FT design, correct
>> failure handling was a goal.

Based on Ted's mention of ITBackupRestore (thanks btwm Ted!), I think 
that gets into the details a little to much for this thread. Definitely 
need to improve on that test for what we're discussing here, but perhaps 
it's a nice starting point?

>> On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <elserj@apache.org> wrote:
>>
>>> Thanks for the quick feedback!
>>>
>>> On 9/12/17 12:36 PM, Stack wrote:
>>>
>>>> On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <
>> andrew.purtell@gmail.com
>>>>>
>>>> wrote:
>>>>
>>>> I think those are reasonable criteria Josh.
>>>>>
>>>>> What I would like to see is something like "we ran ITBLL (or custom
>>>>> generator with similar correctness validation if you prefer) on a dev
>>>>> cluster (5-10 nodes) for 24 hours with server killing chaos agents
>>>>> active,
>>>>> attempted 1,440 backups (one per minute), of which 1,000 succeeded and
>>>>> 100%
>>>>> if these were successfully restored and validated." This implies your
>>>>> points on automation and no manual intervention. Maybe the number of
>>>>> successful backups under challenging conditions will be lower. Point
is
>>>>> they demonstrate we can rely on it even when a cluster is partially
>>>>> unhealthy, which in production is often the normal order of affairs.
>>>>>
>>>>>
>>>>>
>>> I like it. I hadn't thought about stressing quite this aggressively, but
>>> now that I think about it, sounds like a great plan. Having some ballpark
>>> measure to quantify the cost of a "backup-heavy" workload would be cool
>> in
>>> addition to seeing how the system reacts in unexpected manners.
>>>
>>> Sounds good to me.
>>>>
>>>> How will you test the restore aspect? After 1k (or whatever makes sense)
>>>> incremental backups over the life of the chaos, could you restore and
>>>> validate that the table had all expected data in place.
>>>>
>>>
>>> Exactly. My thinking was that, at any point, we should be able to do a
>>> restore and validate. Maybe something like: every Nth ITBLL iteration,
>> make
>>> a new backup point, restore a previous backup point, verify, restore to
>>> newest backup point. The previous backup point should be a full or
>>> incremental point.
>>>
>>> Vlad: I'm obviously curious to see what you think about this stuff, in
>>> addition to what you already had in mind :)
>>>
>>
> 
> 
> 

Mime
View raw message