hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Marc Spaggiari (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11715) HBase should provide a tool to compare 2 remote tables.
Date Fri, 15 Aug 2014 18:57:20 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098940#comment-14098940

Jean-Marc Spaggiari commented on HBASE-11715:

1. How is this table copied. Do we flush and just move the HFiles over.
Copy table is not in the scope for this. This is just a tool to do the comparison or tables

2. What do we do if they are not equivalent. Is it enough to throw an error, or do we need
to say what part of the table isn't equivalent.
We report the information back to the user. Like, for range A to C, content is different between
the 2 tables.

3. Do Merkle trees make sense for this type of thing?
Not sure. We don't have any tree structure here.

I am interested in working on this task. Merkle tree, we need to constantly to run some background
service, and it will require additional amount of data.
I don't think Merkle tree is the right option here. But you can still evaluate it.

Can you provide more details, I can assign it to myself and work on this? 
Sure! Let's go for it.

> HBase should provide a tool to compare 2 remote tables.
> -------------------------------------------------------
>                 Key: HBASE-11715
>                 URL: https://issues.apache.org/jira/browse/HBASE-11715
>             Project: HBase
>          Issue Type: New Feature
>          Components: util
>            Reporter: Jean-Marc Spaggiari
> As discussed in the mailing list, when a table is copied to another cluster and need
to be validated against the first one, only VerifyReplication can be used. However, this can
be very long since data need to be copied again.
> We should provide an easier and faster way to compare the tables. 
> One option is to calculate hashs per ranges. User can define number of buckets, then
we split the table into this number of buckets and calculate an hash for each (Like partitioner
is already doing). We can also optionally calculate an overall CRC to reduce even more hash

This message was sent by Atlassian JIRA

View raw message