hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Normand <alexandre.norm...@gmail.com>
Subject Re: Seeking advice on skipped/lost data during data migration from and to a hbase table
Date Sun, 05 Feb 2017 21:10:20 GMT
That's a good suggestion. I'll give that a try.

Thanks again!

On Sun, Feb 5, 2017 at 1:07 PM Ted Yu <yuzhihong@gmail.com> wrote:

> You can run rowcounter on the source tables multiple times.
>
> With region servers under load, you would observe inconsistent results from
> different runs.
>
> On Sun, Feb 5, 2017 at 12:54 PM, Alexandre Normand <
> alexandre.normand@gmail.com> wrote:
>
> > Thanks, Ted. We're running HBase 1.0.0-cdh5.5.4 which isn't in the fixed
> > versions so this might be related. This is somewhat reassuring to think
> > that this would be missed data on the scan/source side because this would
> > mean that our other ingest/write workloads wouldn't be affected.
> >
> > From reading the jira description, it sounds like it would be difficult
> to
> > confirm that we've been affected by this bug. Am I right?
> >
> > On Sun, Feb 5, 2017 at 12:36 PM Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > Which release of hbase are you using ?
> > >
> > > To be specific, does the release have HBASE-15378 ?
> > >
> > > Cheers
> > >
> > > On Sun, Feb 5, 2017 at 11:32 AM, Alexandre Normand <
> > > alexandre.normand@gmail.com> wrote:
> > >
> > > > We're migrating data from a previous iteration of a table to a new
> one
> > > and
> > > > this process involved a MR job that scans data from the source table
> > and
> > > > writes the equivalent data in the new table. The source table has
> 6000+
> > > > regions and it frequently splits because we're still ingesting time
> > > series
> > > > data into it. We used buffered writing on the other end when writing
> to
> > > the
> > > > new table and we have a yarn resource pool to limit the concurrent
> > > writing.
> > > >
> > > > First, I should say that this job took a long time but still mostly
> > > worked.
> > > > However, we've built a mechanism to compare requested data fetched
> from
> > > > each one of the tables and found that some rows (0.02%) are missing
> > from
> > > > the destination. We've ruled out a few things already:
> > > >
> > > > * Functional bug in the job that would have resulted in skipping that
> > > 0.02%
> > > > of the rows.
> > > > * Potential for that data not having existed when the migration job
> > > > initially ran.
> > > >
> > > > At a high-level, the suspects could be:
> > > >
> > > > * The source table splitting could have resulted in some input keys
> not
> > > > being read. However, since a hbase split is comprised of a
> > > startKey/endKey,
> > > > this seems like this would not be expected unless there was a bug in
> > > there
> > > > somehow.
> > > > * The writing/flushing losing a batch. Since we're buffering writes
> and
> > > > flush everything on the clean up of map tasks, we would expect write
> > > > failures to cause task failures/retries and therefore to not be a
> > problem
> > > > in the end. Given that this flush is synchronous and, according to
> our
> > > > understanding, completes when the data is in the WAL and memstore,
> this
> > > > also seems unlikely unless there's a bug.
> > > >
> > > > I should add that we've extracted a sample of 1% of the source rows
> > > (doing
> > > > all of them is really time consuming because of the size of data) and
> > > found
> > > > that missing data often appears in clusters of the source hbase row
> > keys.
> > > > This doesn't really help pointing at a problem with the scan side of
> > > things
> > > > or the write side of things (since a failure in either would result
> in
> > a
> > > > similar output) but we thought it was interesting. That said, we do
> > have
> > > a
> > > > few keys that are missing that aren't clustered. This could be
> because
> > > > we've only ran the comparison for 1% of the data or it could be that
> > > > whatever is causing this can affect very isolated cases.
> > > >
> > > > We're now trying to understand how this could have happened in order
> to
> > > > understand how it could impact other jobs/applications and also to
> > > increase
> > > > our confidence that we write a modified version of the migration job
> to
> > > > re-migrate the skipped/missing data.
> > > >
> > > > Any ideas or advice would be much appreciated.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > > Alex
> > > >
> > >
> > --
> > Alex
> >
>
-- 
Alex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message