hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13031) Ability to snapshot based on a key range
Date Fri, 13 Feb 2015 19:29:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320616#comment-14320616

Andrew Purtell commented on HBASE-13031:

bq. can you think of how would we do this in a streaming fashion?

Sure. I've been aiming for minimum dev time, considering existing tools. Above suggestion
needs a simple MR job that takes sequence files as input and produces HFiles, not difficult.
Could also require adding lzma codec support to your Hadoop if you're crazy enough to try
it (smile), which I've done before, the coding isn't so bad, but the compressor might be too
slow... Anyway, if you can invest some dev time then there's no reason the export/compress
job workers need write the stream of compressed KVs to the local DFS for a copy, they could
contact workers running at the remote site for streaming transfer and there those workers
could direct write to HFiles for bulk load. You'd have to think about how to handle broken
connections. This could be a fair amount of work but still better for a couple of reasons
- Can compress data for WAN transfer better than we ever could/should in HFiles
- Minimizing data copies at petascale saves a lot of time.

> Ability to snapshot based on a key range
> ----------------------------------------
>                 Key: HBASE-13031
>                 URL: https://issues.apache.org/jira/browse/HBASE-13031
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: churro morales
>            Assignee: churro morales
>             Fix For: 2.0.0, 0.94.26, 1.1.0, 0.98.11
> Posted on the mailing list and seems like some people are interested.  A little background
for everyone.
> We have a very large table, we would like to snapshot and transfer the data to another
cluster (compressed data is always better to ship).  Our problem lies in the fact it could
take many weeks to transfer all of the data and during that time with major compactions, the
data stored in dfs has the potential to double which would cause us to run out of disk space.
> So we were thinking about allowing the ability to snapshot a specific key range.  
> Ideally I feel the approach is that the user would specify a start and stop key, those
would be associated with a region boundary.  If between the time the user submits the request
and the snapshot is taken the boundaries change (due to merging or splitting of regions) the
snapshot should fail.
> We would know which regions to snapshot and if those changed between when the request
was submitted and the regions locked, the snapshot could simply fail and the user would try
again, instead of potentially giving the user more / less than what they had anticipated.
 I was planning on storing the start / stop key in the SnapshotDescription and from there
it looks pretty straight forward where we just have to change the verifier code to accommodate
the key ranges.  
> If this design sounds good to anyone, or if I am overlooking anything please let me know.
 Once we agree on the design, I'll write and submit the patches.

This message was sent by Atlassian JIRA

View raw message