hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matteo Bertozzi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13430) HFiles that are in use by a table cloned from a snapshot may be deleted when that snapshot is deleted
Date Mon, 13 Apr 2015 21:08:14 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493091#comment-14493091
] 

Matteo Bertozzi commented on HBASE-13430:
-----------------------------------------

I think the solution Tobi provided is ok, we already do the same thing with HFileLink
the main problem are files that are moving around, if we want to change the fs layout again
we should avoid having files moving around (see HBASE-7806).

so, to me the only fix we should do here is adding the temp dir (as the HFileLink already
does)
otherwise we end up with much more code and possible "incompatibilities" when you do rolling
upgrade of the masters .

> HFiles that are in use by a table cloned from a snapshot may be deleted when that snapshot
is deleted
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13430
>                 URL: https://issues.apache.org/jira/browse/HBASE-13430
>             Project: HBase
>          Issue Type: Bug
>          Components: hbase
>            Reporter: Tobi Vollebregt
>            Priority: Critical
>              Labels: data-integrity, master
>             Fix For: 2.0.0, 1.1.0, 0.98.13, 1.0.2
>
>         Attachments: HBASE-13430-master-v1.patch, HBASE-13430-master-v2.patch, hbase-13430-attempted-fix.patch,
hbase-13430-test.patch
>
>
> We recently had a production issue in which HFiles that were still in use by a table
were deleted. This appears to have been caused by race conditions in the order in which HFileLinks
are created, combined with the fact that only files younger than {{hbase.master.hfilecleaner.ttl}}
are kept alive.
> This is how to reproduce:
>  * Clone a large snapshot into a new table. The clone operation must take more than {{hbase.master.hfilecleaner.ttl}}
time to guarantee data loss.
>  * Ensure that no other table or snapshot is referencing the HFiles used by the new table.
>  * Delete the snapshot. This breaks the table.
> The main cause is this:
>  * Cloning a snapshot creates the table in the {{HBASE_TEMP_DIRECTORY}}.
>  * However, it immediately creates back references to the HFileLinks that it creates
for the table in the archive directory.
>  * HFileLinkCleaner does not check the {{HBASE_TEMP_DIRECTORY}}, so it considers all
those back references deletable.
>  * The only thing that keeps them alive is the TimeToLiveHFileCleaner, but only for 5
minutes.
>  * So if cloning the snapshot takes more than 5 minutes, and the HFiles aren't referenced
by anything else, data loss is guaranteed.
> I have a unit test reproducing the issue and I tried to fix this, but didn't completely
succeed. I will attach the patch shortly.
> Workarounds:
>  * Don't delete any snapshots that you cloned into a table (we used this successfully--
we actually restored the deleted snapshot from backup using ExportSnapshot after the data
loss happened, which successfully reversed the data loss).
>  * Manually check the back references and create any missing ones after cloning a snapshot.
>  * Increase {{hbase.master.hfilecleaner.ttl}}. (untested)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message