hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LeBlanc, Jacob" <jacob.lebl...@hpe.com>
Subject RE: Missing data in snapshot - possible flush timing issue?
Date Wed, 24 May 2017 15:08:16 GMT
Thanks for looking Ted!

My understanding of the log messages is that the last line of the pastebin is the end of the
flush of the memstore for the region where we missed data, but that line is tagged with "[MemStoreFlusher.1]"
whereas the other regions that were getting flushed as part of snapshot are tagged with "[rs(a1-qa-hbr31416d.lab.lynx-connected.com,16020,1494432106955)-snapshot-pool81-thread-1]".
With only a superficial understanding, it seems like the flush of that region where messages
were tagged with "[MemStoreFlusher.1]", while happening at the same time, wasn't really part
of the snapshot process. For example, line 3 in the pastebin shows the flush of one region
starting and tagged with snapshot-pool81-thread-1, line 4 shows the flush starting for the
region we missed data and tagged with MemStoreFlusher.1, and line 5 continues with the flush
of region as part of snapshot. So it definitely looks like multiple flushes were occurring
at the same time whereas elsewhere in the logs it seems like the flushes are always done sequentially
as part of snapshot. So I came to the theory that perhaps there is a timing issue where the
flushed data for a region is missed as part of a snapshot because the flush is occurring on
another thread as part of normal, periodic flushing of memstore.

The last line I see in the full region server log that has anything to do with the snapshot
is line 11 in the pastebin at 2017-05-12 02:06:05,577 where it's processing events from zookeeper.
Again with only a superficial understanding, I was assuming this had something to do with
the master signaling that the snapshot was complete. We'll be sure to capture the master log
next time.

And thanks for also checking JIRA for me. If there is a bug here it seems as though we don't
have an option to upgrade to fix it and we'll have to plan on coding around it for now.

Thanks,

--Jacob

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Wednesday, May 24, 2017 8:47 AM
To: user@hbase.apache.org
Subject: Re: Missing data in snapshot - possible flush timing issue?

bq. the snapshot finishes before the flush of that last region finishes

According to the last line in the pastebin, flush finished at 2017-05-12
02:06:06,063
Did you find something in master log which indicated that snapshot finished before the above
time ?

I went thru snapshot bug fixes in branch-1.1 backward until seeing HBASE-15430 which was in
1.1.4 Among more recent snapshot fixes, there was none matching your description.

Next time this happens to your cluster, can you capture master log (hopefully with DEBUG logging)
? We would get more clue from master log.

Thanks

On Wed, May 24, 2017 at 1:21 AM, LeBlanc, Jacob <jacob.leblanc@hpe.com>
wrote:

> Hi all,
>
> We've recently been testing a Spark job that takes a snapshot of an 
> HBase table and then creates an RDD using TableSnapshotInputFormat 
> class, and we've run into an interesting issue. It appears that the 
> snapshot, and therefore resulting RDD, was missing some data that was 
> recently written to the table. After some digging, it looks like the 
> missing data was limited to a particular region as if the memstore for 
> the region failed to flush before the snapshot was taken.
>
> For our testing the job was scheduled in Oozie to run hourly and we 
> are missing data from the region for exactly the last hour, indicating 
> that when the job ran at 1:05 and took the snapshot everything was 
> flushed and worked fine, but at 2:05 when the job ran we missed 
> everything written to the region since the flush at 1:05.
>
> Here is an excerpt from the region server at that time:
>
> https://pastebin.com/1ECXjhRp
>
> Of the three regions for the table owned by this region server, two of 
> them have messages from "snapshot-pool81-thread-1" saying it's 
> flushing the memstore. But at "2017-05-12 02:06:03,229" before the 
> flush of region with rowkey starting with ";uP\xE1" finishes, there is 
> a message from "MemStoreFlusher.1" saying it's starting the flush of 
> memstore for region with rowkey starting with "\xC0\x8F". Then it 
> appears that the snapshot finishes before the flush of that last 
> region finishes, and that is the exact region from which we are missing data.
>
> So it kind of looks like some timing issue where a normal scheduled 
> flush of the region's memstore kicked in on one thread while the 
> regions were being flushed for the snapshot by another thread causing 
> us to miss the flushed data for that one region in the snapshot. Out 
> of something like
> 50-100 times this job was run we saw this occur twice.
>
> We are running HBase 1.1.4. Is this a known problem? I've tried 
> looking around JIRA but didn't find anything obvious. Anyone familiar 
> with the code could see how this might happen?
>
> We can code defensively against this, but wanted to understand what we 
> are dealing with.
>
> BTW, I don't have the logs from the master at that time as they've 
> been rolled since then.
>
> Thanks!
>
> --Jacob LeBlanc
>
Mime
View raw message