hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: How to verify data from MySQL and HBase
Date Wed, 13 Aug 2014 05:14:28 GMT
Just in the interest of stating the obvious: Don't write a tool that scans through all the
data in MySQL (or HBase) and then looks up each individual row (or even batches of rows) in
the other store. That is very inefficient if you have a lot of data.

Do it like a merge-join instead: Get sorted results from MySQL such that they sort in the
same order as the HBase key. Then read through those results and the HBase scan at the same
time, advancing both sides together. That way you need only a single scan on both sides (per
table).

-- Lars



________________________________
 From: tobe <tobeg3oogle@gmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org> 
Sent: Tuesday, August 12, 2014 7:01 AM
Subject: Re: How to verify data from MySQL and HBase
 

Thank @JM for the detailed explanation.

I totally agree with you and we're developing an internal tool to do it.
It's not so general because we have to write the sql and generate the row
key manually. But it works and easy to understand. I would like to share
with anybody who also needs it.






On Tue, Aug 12, 2014 at 8:45 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Tobe,
>
> Thing is, your data in HBase might be organize very differently than in
> MySQL. Have you denormlized some of it? Have you used some Avro containers
> into an HBase cell? Have you do any cleanup? Or enrichment? At the end, it
> my be very different that what is stored into MySQL. There is not any easy
> process to validate that migration as been done correctly between the two
> databases because tools don't know the transformations you applied. I think
> you will have to build someone internally which will do this validation
> because not such tool exist today.
>
> You can use the row counters to count the rows, but will that mean content
> is correct? No. Calculate a CRC on the cells value? Not even, because you
> might have denormalized.
>
> Sorry, but you will have to do some coding here I think.
>
> JM
>
> JM
>
>
> 2014-08-12 3:49 GMT-04:00 tobe <tobeg3oogle@gmail.com>:
>
> > Thanks for replaying. @Serega
> >
> > I know MySQL and HBase are reliable. What I want to validate is the date
> in
> > both MySQL and HBase. The upper-stream application and unexpected
> operation
> > may make it inconsistent.
> >
> > I'm also wondering could sqoop validate the values from each database?
> > There's RowCountValidator but it's not suitable for us.
> >
> >
> > On Tue, Aug 12, 2014 at 3:23 PM, Serega Sheypak <
> serega.sheypak@gmail.com>
> > wrote:
> >
> > > you should design resilient ETL-processes. Also introduce post-ETL
> > checks.
> > > There is no need to test MySQL or HBase. They are already tested.
> > > See this:
> > >
> >
> http://www.slideshare.net/wyaddow/data-verification-in-qa-department-final
> > > Pretty old, but gives basic ideas. Nothing changed from that time.
> > >
> > >
> > > 2014-08-12 7:55 GMT+04:00 tobe <tobeg3oogle@gmail.com>:
> > >
> > > > Most of our users migrated their data form MySQL to HBase. Before
> they
> > > > totally trust HBase, they use MySQL and HBase at the same time.
> > Sometimes
> > > > the data is inconsistent because they use it incorrectly or maybe
> > > there're
> > > > bugs of HBase. Anyway, we have to make sure the data from MySQL and
> > HBase
> > > > is consistent.
> > > >
> > > > So how can we do that? Write a simple script or is there any general
> > > > method?
> > > >
> > >
> >
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message