hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15968) MVCC-sensitive semantics of versions
Date Tue, 13 Sep 2016 17:00:26 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15487753#comment-15487753

Phil Yang commented on HBASE-15968:

[~stack] Thanks for your reply. This patch has some bugs, I fixed them locally and not already
upload a new patch because it does not support visibility labels. I can upload a new patch
with the fixes (although visibility labels is not done) if needed and upload it to review
board to help you review.

This 'fixed' behavior should be default in 2.0.0
A concern is performance. In current "broken" behavior, we read cell by the order of timestamp
desc and each cell we have an O(1) time complexity.  But in the new behavior, we have to save
some info from all cells whose ts is higher than this cell(and all cells for family-delete
marker) and check if we can see this cell according to their ts and mvcc, which is not O(1).
I am not very sure, complexity may be O(N*logM) where N is number of delete markers whose
ts is higher but mvcc is lower, and M is the maxversion of the cf's conf. I implement the
data structure by current design because I think N will not be very high even if we have many
Puts and Deletes because in the most case we will not have a Cell with higher mvcc but lower
timestamp, and M is usually only 1,2, 3 or some small number.

Yeah, its an outstanding question as to when it is safe to set sequenceid/mvcc == 0.
This is for new tables only?
In the patch I disable this feature, we always save mvcc. So if we alter a table into new
behavior, we should handle Cells whose mvcc is in HFile's header. Many Cells will have same
mvcc, which is not a very difficult issue but we need prove there is no bug for this situation.
And we have to define the order with same mvcc, just like we define the order of Type.

mvcc-sensitive is not a good name because the whole system is already mvcc sensitive.
To be honest, I spend some time on naming this issue but I have no idea what is the best....
 Just call it "fix the bug" is very exciting for me :)

> MVCC-sensitive semantics of versions
> ------------------------------------
>                 Key: HBASE-15968
>                 URL: https://issues.apache.org/jira/browse/HBASE-15968
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Phil Yang
>            Assignee: Phil Yang
>         Attachments: HBASE-15968-v1.patch
> In HBase book, we have a section in Versions called "Current Limitations" see http://hbase.apache.org/book.html#_current_limitations
> {quote}
> 28.3. Current Limitations
> 28.3.1. Deletes mask Puts
> Deletes mask puts, even puts that happened after the delete was entered. See HBASE-2256.
Remember that a delete writes a tombstone, which only disappears after then next major compaction
has run. Suppose you do a delete of everything ⇐ T. After this you do a new put with a timestamp
⇐ T. This put, even if it happened after the delete, will be masked by the delete tombstone.
Performing the put will not fail, but when you do a get you will notice the put did have no
effect. It will start working again after the major compaction has run. These issues should
not be a problem if you use always-increasing versions for new puts to a row. But they can
occur even if you do not care about time: just do delete and put immediately after each other,
and there is some chance they happen within the same millisecond.
> 28.3.2. Major compactions change query results
> …​create three cell versions at t1, t2 and t3, with a maximum-versions setting of
2. So when getting all versions, only the values at t2 and t3 will be returned. But if you
delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction
has run, such behavior will not be the case anymore…​ (See Garbage Collection in Bending
time in HBase.)
> {quote}
> These limitations result from the current implementation on multi-versions: we only consider
timestamp, no matter when it comes; we will not remove old version immediately if there are
enough number of new versions. 
> So we can get a stronger semantics of versions by two guarantees:
> 1, Delete will not mask Put that comes after it.
> 2, If a version is masked by enough number of higher versions (VERSIONS in cf's conf),
it will never be seen any more.
> Some examples for understanding:
> (delete t<=3 means use Delete.addColumns to delete all versions whose ts is not greater
than 3, and delete t3 means use Delete.addColumn to delete the version whose ts=3)
> case 1: put t2 -> put t3 -> delete t<=3 -> put t1, and we will get t1 because
the put is after delete.
> case 2: maxversion=2, put t1 -> put t2 -> put t3 -> delete t3, and we will always
get t2 no matter if there is a major compaction, because t1 is masked when we put t3 so t1
will never be seen.
> case 3: maxversion=2, put t1 -> put t2 -> put t3 -> delete t2 -> delete t3,
and we will get nothing.
> case 4: maxversion=3, put t1 -> put t2 -> put t3 -> delete t2 -> delete t3,
and we will get t1 because it is not masked.
> case 5: maxversion=2, put t1 -> put t2 -> put t3 -> delete t3 -> put t1,
and we can get t3+t1 because when we put t1 at second time it is the 2nd latest version and
it can be read.
> case 6:maxversion=2, put t3->put t2->put t1, and we will get t3+t2 just like what
we can get now, ts is still the key of versions.
> Different VERSIONS may result in different results even the size of result is smaller
than VERSIONS(see case 3 and 4).  So Get/Scan.setMaxVersions will be handled at end after
we read correct data according to CF's  VERSIONS setting.
> The semantics is different from the current HBase, and we may need more logic to support
the new semantic, so it is configurable and default is disabled.

This message was sent by Atlassian JIRA

View raw message