hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Normand <alexandre.norm...@gmail.com>
Subject Re: Seeking advice on skipped/lost data during data migration from and to a hbase table
Date Sun, 05 Feb 2017 20:54:44 GMT
Thanks, Ted. We're running HBase 1.0.0-cdh5.5.4 which isn't in the fixed
versions so this might be related. This is somewhat reassuring to think
that this would be missed data on the scan/source side because this would
mean that our other ingest/write workloads wouldn't be affected.

>From reading the jira description, it sounds like it would be difficult to
confirm that we've been affected by this bug. Am I right?

On Sun, Feb 5, 2017 at 12:36 PM Ted Yu <yuzhihong@gmail.com> wrote:

> Which release of hbase are you using ?
>
> To be specific, does the release have HBASE-15378 ?
>
> Cheers
>
> On Sun, Feb 5, 2017 at 11:32 AM, Alexandre Normand <
> alexandre.normand@gmail.com> wrote:
>
> > We're migrating data from a previous iteration of a table to a new one
> and
> > this process involved a MR job that scans data from the source table and
> > writes the equivalent data in the new table. The source table has 6000+
> > regions and it frequently splits because we're still ingesting time
> series
> > data into it. We used buffered writing on the other end when writing to
> the
> > new table and we have a yarn resource pool to limit the concurrent
> writing.
> >
> > First, I should say that this job took a long time but still mostly
> worked.
> > However, we've built a mechanism to compare requested data fetched from
> > each one of the tables and found that some rows (0.02%) are missing from
> > the destination. We've ruled out a few things already:
> >
> > * Functional bug in the job that would have resulted in skipping that
> 0.02%
> > of the rows.
> > * Potential for that data not having existed when the migration job
> > initially ran.
> >
> > At a high-level, the suspects could be:
> >
> > * The source table splitting could have resulted in some input keys not
> > being read. However, since a hbase split is comprised of a
> startKey/endKey,
> > this seems like this would not be expected unless there was a bug in
> there
> > somehow.
> > * The writing/flushing losing a batch. Since we're buffering writes and
> > flush everything on the clean up of map tasks, we would expect write
> > failures to cause task failures/retries and therefore to not be a problem
> > in the end. Given that this flush is synchronous and, according to our
> > understanding, completes when the data is in the WAL and memstore, this
> > also seems unlikely unless there's a bug.
> >
> > I should add that we've extracted a sample of 1% of the source rows
> (doing
> > all of them is really time consuming because of the size of data) and
> found
> > that missing data often appears in clusters of the source hbase row keys.
> > This doesn't really help pointing at a problem with the scan side of
> things
> > or the write side of things (since a failure in either would result in a
> > similar output) but we thought it was interesting. That said, we do have
> a
> > few keys that are missing that aren't clustered. This could be because
> > we've only ran the comparison for 1% of the data or it could be that
> > whatever is causing this can affect very isolated cases.
> >
> > We're now trying to understand how this could have happened in order to
> > understand how it could impact other jobs/applications and also to
> increase
> > our confidence that we write a modified version of the migration job to
> > re-migrate the skipped/missing data.
> >
> > Any ideas or advice would be much appreciated.
> >
> > Thanks!
> >
> > --
> > Alex
> >
>
-- 
Alex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message