hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Latham (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13639) SyncTable - rsync for HBase tables
Date Tue, 29 Mar 2016 15:20:25 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216169#comment-15216169

Dave Latham commented on HBASE-13639:

Sorry for the lack of better documentation, [~abhishek_soni].  Thanks for bringing it up.
 I'll try to provide a better explanation.  You may have already seen it, but if not, the
design doc linked in the description above may also give you some better clues as to how it
should be used.

Briefly, the feature is intended to start with a pair of tables in remote clusters that are
already substantially similar and make them identical by comparing hashes of the data and
copying only the diffs instead of having to copy the entire table.  So it is targeted at a
very specific use case (with some work it could generalize to cover things like CopyTable
and VerifyRepliaction but it's not there yet).  To use it, you choose one table to be the
"source", and the other table is the "target".  After the process is complete the target table
should end up being identical to the source table.

- In the source table's cluster, run org.apache.hadoop.hbase.mapreduce.HashTable and pass
it the name of the source table and an output directory in HDFS.  HashTable will scan the
source table, break the data up into row key ranges (default of 8kB per range) and produce
a hash of the data for each range.
- Make the hashes available to the target cluster - I'd recommend using DistCp to copy it
- In the target table's cluster, run org.apache.hadoop.hbase.mapreduce.SyncTable and pass
it the directory where you put the hashes, and the names of the source and destination tables.
 You will likely also need to specify the source table's ZK quorum via the --sourcezkcluster
option.  SyncTable will then read the hash information, and compute the hashes of the same
row ranges for the target table.  For any row range where the hash fails to match, it will
open a remote scanner to the source table, read the data for that range, and do Puts and Deletes
to the target table to update it to match the source.

I hope that clarifies it a bit.  Let me know if you need a hand.  If anyone wants to work
on getting some documentation into the book, I can try to write some more but would love a
hand on turning it into an actual book patch.

> SyncTable - rsync for HBase tables
> ----------------------------------
>                 Key: HBASE-13639
>                 URL: https://issues.apache.org/jira/browse/HBASE-13639
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce, Operability, tooling
>            Reporter: Dave Latham
>            Assignee: Dave Latham
>              Labels: tooling
>             Fix For: 2.0.0, 0.98.14, 1.2.0
>         Attachments: HBASE-13639-0.98-addendum-hadoop-1.patch, HBASE-13639-0.98.patch,
HBASE-13639-v1.patch, HBASE-13639-v2.patch, HBASE-13639-v3-0.98.patch, HBASE-13639-v3.patch,
> Given HBase tables in remote clusters with similar but not identical data, efficiently
update a target table such that the data in question is identical to a source table.  Efficiency
in this context means using far less network traffic than would be required to ship all the
data from one cluster to the other.  Takes inspiration from rsync.
> Design doc: https://docs.google.com/document/d/1-2c9kJEWNrXf5V4q_wBcoIXfdchN7Pxvxv1IO6PW0-U/

This message was sent by Atlassian JIRA

View raw message