hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@apache.org>
Subject Re: Seeking advice on skipped/lost data during data migration from and to a hbase table
Date Tue, 07 Feb 2017 18:09:23 GMT
HBASE-15378 says that it was caused by HBASE-13090, I think.

That issue is present in CDH5.5.4:

http://archive.cloudera.com/cdh5/cdh/5/hbase-1.0.0-cdh5.5.4.releasenotes.html

(Search in page for HBASE-13090)

On Tue, Feb 7, 2017 at 11:51 AM, Alexandre Normand
<alexandre.normand@gmail.com> wrote:
> Reporting back with some results.
>
> We ran several RowCounters and each one gives us the same count back. It
> could be because RowCounter is much more lightweight than our migration job
> (which reads every cell and turns back to write an equivalent version in
> another table) but it's hard to tell.
>
> Taking a step back, it looks like the bug described in HBASE-15378 was
> introduced in 1.1.0 which wouldn't affect us since we're still
> on 1.0.0-cdh5.5.4.
>
> I guess that puts us back to square one. Any other ideas?
>
> On Sun, Feb 5, 2017 at 1:10 PM Alexandre Normand <
> alexandre.normand@gmail.com> wrote:
>
>> That's a good suggestion. I'll give that a try.
>>
>> Thanks again!
>>
>> On Sun, Feb 5, 2017 at 1:07 PM Ted Yu <yuzhihong@gmail.com> wrote:
>>
>> You can run rowcounter on the source tables multiple times.
>>
>> With region servers under load, you would observe inconsistent results from
>> different runs.
>>
>> On Sun, Feb 5, 2017 at 12:54 PM, Alexandre Normand <
>> alexandre.normand@gmail.com> wrote:
>>
>> > Thanks, Ted. We're running HBase 1.0.0-cdh5.5.4 which isn't in the fixed
>> > versions so this might be related. This is somewhat reassuring to think
>> > that this would be missed data on the scan/source side because this would
>> > mean that our other ingest/write workloads wouldn't be affected.
>> >
>> > From reading the jira description, it sounds like it would be difficult
>> to
>> > confirm that we've been affected by this bug. Am I right?
>> >
>> > On Sun, Feb 5, 2017 at 12:36 PM Ted Yu <yuzhihong@gmail.com> wrote:
>> >
>> > > Which release of hbase are you using ?
>> > >
>> > > To be specific, does the release have HBASE-15378 ?
>> > >
>> > > Cheers
>> > >
>> > > On Sun, Feb 5, 2017 at 11:32 AM, Alexandre Normand <
>> > > alexandre.normand@gmail.com> wrote:
>> > >
>> > > > We're migrating data from a previous iteration of a table to a new
>> one
>> > > and
>> > > > this process involved a MR job that scans data from the source table
>> > and
>> > > > writes the equivalent data in the new table. The source table has
>> 6000+
>> > > > regions and it frequently splits because we're still ingesting time
>> > > series
>> > > > data into it. We used buffered writing on the other end when writing
>> to
>> > > the
>> > > > new table and we have a yarn resource pool to limit the concurrent
>> > > writing.
>> > > >
>> > > > First, I should say that this job took a long time but still mostly
>> > > worked.
>> > > > However, we've built a mechanism to compare requested data fetched
>> from
>> > > > each one of the tables and found that some rows (0.02%) are missing
>> > from
>> > > > the destination. We've ruled out a few things already:
>> > > >
>> > > > * Functional bug in the job that would have resulted in skipping that
>> > > 0.02%
>> > > > of the rows.
>> > > > * Potential for that data not having existed when the migration job
>> > > > initially ran.
>> > > >
>> > > > At a high-level, the suspects could be:
>> > > >
>> > > > * The source table splitting could have resulted in some input keys
>> not
>> > > > being read. However, since a hbase split is comprised of a
>> > > startKey/endKey,
>> > > > this seems like this would not be expected unless there was a bug
in
>> > > there
>> > > > somehow.
>> > > > * The writing/flushing losing a batch. Since we're buffering writes
>> and
>> > > > flush everything on the clean up of map tasks, we would expect write
>> > > > failures to cause task failures/retries and therefore to not be a
>> > problem
>> > > > in the end. Given that this flush is synchronous and, according to
>> our
>> > > > understanding, completes when the data is in the WAL and memstore,
>> this
>> > > > also seems unlikely unless there's a bug.
>> > > >
>> > > > I should add that we've extracted a sample of 1% of the source rows
>> > > (doing
>> > > > all of them is really time consuming because of the size of data)
and
>> > > found
>> > > > that missing data often appears in clusters of the source hbase row
>> > keys.
>> > > > This doesn't really help pointing at a problem with the scan side
of
>> > > things
>> > > > or the write side of things (since a failure in either would result
>> in
>> > a
>> > > > similar output) but we thought it was interesting. That said, we do
>> > have
>> > > a
>> > > > few keys that are missing that aren't clustered. This could be
>> because
>> > > > we've only ran the comparison for 1% of the data or it could be that
>> > > > whatever is causing this can affect very isolated cases.
>> > > >
>> > > > We're now trying to understand how this could have happened in order
>> to
>> > > > understand how it could impact other jobs/applications and also to
>> > > increase
>> > > > our confidence that we write a modified version of the migration job
>> to
>> > > > re-migrate the skipped/missing data.
>> > > >
>> > > > Any ideas or advice would be much appreciated.
>> > > >
>> > > > Thanks!
>> > > >
>> > > > --
>> > > > Alex
>> > > >
>> > >
>> > --
>> > Alex
>> >
>>
>> --
>> Alex
>>
> --
> Alex

Mime
View raw message