hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tobi Vollebregt (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13430) HFiles that are in use by a table cloned from a snapshot may be deleted when that snapshot is deleted
Date Fri, 10 Apr 2015 22:52:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490498#comment-14490498

Tobi Vollebregt commented on HBASE-13430:

Okay, so based on running the tests many times in the background I'm pretty sure that my changes
to {{HFileLink}} cause:
Expected :17576
Actual   :14046 

If I undo that change, but keep my changes to {{HFileCleaner}}, then I'm not seeing that exception
in 50 test runs, but if I keep the changes to {{HFileLink}} then it fails roughly 1 out of
10 times with the above {{AssertionError}} in the *existing tests*.

I will run some more tests to make sure that the changes to {{HFileCleaner}} are sufficient
to fix the issue, and then I'll submit a smaller patch that does not modify {{HFileLink}}.

> HFiles that are in use by a table cloned from a snapshot may be deleted when that snapshot
is deleted
> -----------------------------------------------------------------------------------------------------
>                 Key: HBASE-13430
>                 URL: https://issues.apache.org/jira/browse/HBASE-13430
>             Project: HBase
>          Issue Type: Bug
>          Components: hbase
>            Reporter: Tobi Vollebregt
>            Priority: Critical
>              Labels: data-integrity, master
>             Fix For: 2.0.0, 1.1.0, 0.98.13, 1.0.2
>         Attachments: HBASE-13430-master-v1.patch, hbase-13430-attempted-fix.patch, hbase-13430-test.patch
> We recently had a production issue in which HFiles that were still in use by a table
were deleted. This appears to have been caused by race conditions in the order in which HFileLinks
are created, combined with the fact that only files younger than {{hbase.master.hfilecleaner.ttl}}
are kept alive.
> This is how to reproduce:
>  * Clone a large snapshot into a new table. The clone operation must take more than {{hbase.master.hfilecleaner.ttl}}
time to guarantee data loss.
>  * Ensure that no other table or snapshot is referencing the HFiles used by the new table.
>  * Delete the snapshot. This breaks the table.
> The main cause is this:
>  * Cloning a snapshot creates the table in the {{HBASE_TEMP_DIRECTORY}}.
>  * However, it immediately creates back references to the HFileLinks that it creates
for the table in the archive directory.
>  * HFileLinkCleaner does not check the {{HBASE_TEMP_DIRECTORY}}, so it considers all
those back references deletable.
>  * The only thing that keeps them alive is the TimeToLiveHFileCleaner, but only for 5
>  * So if cloning the snapshot takes more than 5 minutes, and the HFiles aren't referenced
by anything else, data loss is guaranteed.
> I have a unit test reproducing the issue and I tried to fix this, but didn't completely
succeed. I will attach the patch shortly.
> Workarounds:
>  * Don't delete any snapshots that you cloned into a table (we used this successfully--
we actually restored the deleted snapshot from backup using ExportSnapshot after the data
loss happened, which successfully reversed the data loss).
>  * Manually check the back references and create any missing ones after cloning a snapshot.
>  * Increase {{hbase.master.hfilecleaner.ttl}}. (untested)

This message was sent by Atlassian JIRA

View raw message