hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-16820) BulkLoad mvcc visibility only works accidentally
Date Thu, 13 Oct 2016 17:08:21 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572549#comment-15572549

stack commented on HBASE-16820:

Thanks for the reasoning. Interesting from your recounting is the fact that flushing first
is optional (I'd thought this non-optional) and the bit where when we unified sequenceid and
mvcc, we missed this special-case. The patch looks like it will 'work' but I like the suggestion
that bulk load get its own sequenceid that is apart from that of the flush; if I remember
from the original implementation, we used the flush sequenceid because that was guaranteed
'safe'; going beyond this id there were concerns some other edit could sneak in between the
flush and bulk load. I think now though we have mechanism to get a sequenceid with guaranteeds
that it immediately follows the flush id with no edits able to sneak in behind. Nice work

> BulkLoad mvcc visibility only works accidentally 
> -------------------------------------------------
>                 Key: HBASE-16820
>                 URL: https://issues.apache.org/jira/browse/HBASE-16820
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: HBASE-16820-branch-1.1-v0.patch
> [~sergey.soldatov] has been debugging an issue with a 1.1 code base where the commit
for HBASE-16721 broke the bulk load visibility. After bulk load, the bulk load files is not
visible because the sequence id assigned to the bulk load is not advanced in mvcc. 
> Debugging further, we have noticed that bulk load behavior is wrong, but it works "accidentally"
in all code bases (but broken in 1.1 after HBASE-16721). Let me explain: 
>  - BL request can optionally request a flush before hand (this should be the default)
which causes the flush to happen with some sequenceId. The flush sequence id is one past all
the cells' sequenceids. This flush sequence id is returned as a result to the flush operation.

>  - BL then uses this particular sequenceId to mark the files, but itself does not get
a new sequenceid of its own, or advance the mvcc number. 
>  - BL completes WITHOUT making sure that the sequence id is visible. 
>  - BL itself though writes entries to the WAL for the BL event, which in 1.2 code bases
goes through the whole mvcc + seqId paths, which makes sure that earlier sequenceIds (the
flush sequenceId) are visible via mvcc. 
> The problem with 1.1 is that the WAL entries only get sequence ids, but do not touch
mvcc. With the patch for HBASE-16721, we have made it so that the flushedSequenceId is not
used in mvcc as the highest read point (although all the data is still visible).
> BL relying on the flush sequence id is wrong for two reasons: 
>  - BL files are loaded with the flush sequence id from the memstore. This particular
sequence id is used twice for two different things and ends up being the sequence id for flushed
file as well as BL'ed files. 
>  - BL should make sure that it gets a new sequence id and that sequence id is visible
before returning the results. 
> [~ndimiduk] FYI. 

This message was sent by Atlassian JIRA

View raw message