hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17852) Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
Date Thu, 30 Nov 2017 20:39:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273376#comment-16273376
] 

Josh Elser commented on HBASE-17852:
------------------------------------

bq. Operators who'd rather avoid reading logs and having to run repair tools are 'lazy'.

bq. Do not we still have hbck for this reason? Repair \[...\] which happens periodically in
HBase cluster.

Let me also expand on this: I would consider "lazy" as a virtue for operators. The system
should automatically handle as much as possible. There's a fundamental difference between
what hbck is and what `hbase backup repair` is: HBCK is fixing things that inadvertently happen
server-side (hopefully, only around bugs which has since been fixed) whereas hbase-backup
are completely client-driven. For example, something as benign as a user ctrl-C'ing a backup
because they mis-typed the backup name or table being backed up would cause the backup table
to need a repair.

bq. This is the question actually, should we do repair automatically or we need to inform
user, that there was abnormal failure of a last backup/merge/delete command and user need
to run repair.

I was about to write that I thought it was a no-brainer to blindly run a repair as a part
of the BackupDriver, but now I wonder about the following:

Take two administrators running backups, unaware of each other. Admin1 starts a backup on
Table1. Before Admin1's backup finishes, Admin2 tries to do a backup on Table2. Could Admin2
preempt/fail Admin1's backup by running a {{hbase backup repair}} while Admin1 is using the
system?

In other words: does {{hbase backup repair}} have the ability to differentiate between "user
is currently executing a backup" and "stale state exists in the table from an aborted/unfinished
operation"?

> Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental backup)
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-17852
>                 URL: https://issues.apache.org/jira/browse/HBASE-17852
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>             Fix For: 2.0.0
>
>         Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, HBASE-17852-v3.patch,
HBASE-17852-v4.patch, HBASE-17852-v5.patch, HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch,
HBASE-17852-v9.patch
>
>
> Design approach rollback-via-snapshot implemented in this ticket:
> # Before backup create/delete/merge starts we take a snapshot of the backup meta-table
(backup system table). This procedure is lightweight because meta table is small, usually
should fit a single region.
> # When operation fails on a server side, we handle this failure by cleaning up partial
data in backup destination, followed by restoring backup meta-table from a snapshot. 
> # When operation fails on a client side (abnormal termination, for example), next time
user will try create/merge/delete he(she) will see error message, that system is in inconsistent
state and repair is required, he(she) will need to run backup repair tool.
> # To avoid multiple writers to the backup system table (backup client and BackupObserver's)
we introduce small table ONLY to keep listing of bulk loaded files. All backup observers will
work only with this new tables. The reason: in case of a failure during backup create/delete/merge/restore,
when system performs automatic rollback, some data written by backup observers during failed
operation may be lost. This is what we try to avoid.
> # Second table keeps only bulk load related references. We do not care about consistency
of this table, because bulk load is idempotent operation and can be repeated after failure.
Partially written data in second table does not affect on BackupHFileCleaner plugin, because
this data (list of bulk loaded files) correspond to a files which have not been loaded yet
successfully and, hence - are not visible to the system 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message