hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13042) MR Job to export HFiles directly from an online cluster
Date Fri, 13 Feb 2015 19:30:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320618#comment-14320618

Enis Soztutar commented on HBASE-13042:

Here is an idea. Not sure it will help you or not. 

{{TableSnapshotInputFormat}} allows you to run any MR job directly over the snapshot. It also
accepts key ranges, and eliminates regions out of the range. 

Without HBASE-13031 a snapshot is still a full table snapshot, but what you can do is: 

Decide on table ranges (lets say N ranges)
 for i  in 0..N
   (1) take snapshot 
   (2) use custom MR job to export the data (create hfiles for bulk load) over the snapshot
for Range[i]
   (3) delete the snapshot 
You will only hold onto the single snapshot during (2), which you can control for how long
it will take depending on the size of Range[i].

> MR Job to export HFiles directly from an online cluster
> -------------------------------------------------------
>                 Key: HBASE-13042
>                 URL: https://issues.apache.org/jira/browse/HBASE-13042
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Dave Latham
> We're looking at the best way to bootstrap a new remote cluster.  The source cluster
has a a large table of compressed data using more than 50% of the HDFS capacity and we have
a WAN link to the remote cluster.  Ideally we would set up replication to a new table remotely,
snapshot the source table, copy the snapshot across, then bulk load it into the new table.
 However the amount of time to copy the data remotely is greater than the major compaction
interval so the source cluster would run out of storage.
> One approach is HBASE-13031 to allow the operators to snapshot and copy one key range
at a time.  Here's another idea:
> Create a MR job that tries to do a robust remote HFile copy directly:
>  * Each split is responsible for a key range.
>  * Map task lookups up that key range and maps it to a set of HDFS store directories
(one for each region/family)
>  * For each store:
>    ** List HFiles in store (needs to be less than 1000 files to guarantee atomic listing)
>    ** Attempt to copy store files (copy in increasing size order to minimize likelihood
of compaction removing a file during copy)
>    ** If some of the files disappear (compaction), retry directory list / copy
>  * If any of the stores disappear (region split / merge) then retry map task (and remap
key range to stores)
> Or maybe there are some HBase locking mechanisms for a region or store that would be
better.  Otherwise the question is how often would compactions or region splits force retries.
> Is this crazy? 

This message was sent by Atlassian JIRA

View raw message