hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Missing data in snapshot - possible flush timing issue?
Date Wed, 24 May 2017 12:46:44 GMT
bq. the snapshot finishes before the flush of that last region finishes

According to the last line in the pastebin, flush finished at 2017-05-12
02:06:06,063
Did you find something in master log which indicated that snapshot finished
before the above time ?

I went thru snapshot bug fixes in branch-1.1 backward until seeing HBASE-15430
which was in 1.1.4
Among more recent snapshot fixes, there was none matching your description.

Next time this happens to your cluster, can you capture master log
(hopefully with DEBUG logging) ? We would get more clue from master log.

Thanks

On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob <jacob.leblanc@hpe.com>
wrote:

> Hi all,
>
> We've recently been testing a Spark job that takes a snapshot of an HBase
> table and then creates an RDD using TableSnapshotInputFormat class, and
> we've run into an interesting issue. It appears that the snapshot, and
> therefore resulting RDD, was missing some data that was recently written to
> the table. After some digging, it looks like the missing data was limited
> to a particular region as if the memstore for the region failed to flush
> before the snapshot was taken.
>
> For our testing the job was scheduled in Oozie to run hourly and we are
> missing data from the region for exactly the last hour, indicating that
> when the job ran at 1:05 and took the snapshot everything was flushed and
> worked fine, but at 2:05 when the job ran we missed everything written to
> the region since the flush at 1:05.
>
> Here is an excerpt from the region server at that time:
>
> https://pastebin.com/1ECXjhRp
>
> Of the three regions for the table owned by this region server, two of
> them have messages from "snapshot-pool81-thread-1" saying it's flushing the
> memstore. But at "2017-05-12 02:06:03,229" before the flush of region with
> rowkey starting with ";uP\xE1" finishes, there is a message from
> "MemStoreFlusher.1" saying it's starting the flush of memstore for region
> with rowkey starting with "\xC0\x8F". Then it appears that the snapshot
> finishes before the flush of that last region finishes, and that is the
> exact region from which we are missing data.
>
> So it kind of looks like some timing issue where a normal scheduled flush
> of the region's memstore kicked in on one thread while the regions were
> being flushed for the snapshot by another thread causing us to miss the
> flushed data for that one region in the snapshot. Out of something like
> 50-100 times this job was run we saw this occur twice.
>
> We are running HBase 1.1.4. Is this a known problem? I've tried looking
> around JIRA but didn't find anything obvious. Anyone familiar with the code
> could see how this might happen?
>
> We can code defensively against this, but wanted to understand what we are
> dealing with.
>
> BTW, I don't have the logs from the master at that time as they've been
> rolled since then.
>
> Thanks!
>
> --Jacob LeBlanc
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message