hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-8721) Deletes can mask puts that happen after the delete
Date Tue, 18 Jun 2013 21:32:22 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13687222#comment-13687222
] 

Sergey Shelukhin edited comment on HBASE-8721 at 6/18/13 9:31 PM:
------------------------------------------------------------------

Sorry for long time taken, I was out.

I think we agree that influence of major compaction is a glitch. Technically you don't have
to store all delete markers to prevent it, only latest for each key.
We can add a timeout for when delete markers are dropped; if that is set to high enough value,
then you'd drop them after a long time.
It's similar for TTL for deleted cells now, which allow you to do point-in-time scans on old
data after it is deleted, but not forever, only for approx. column family TTL.

For semantics:
{quote}2. The behaviour "delete can mask puts that happened after the delete" is unacceptable
for many users. When a user puts a kv to HBase, his intention is to ADD that kv to HBase and
definitely he want to be able to retrieve that kv back using a Get/Scan operation without
regard to whether or not there is a delete ever occurred. Why current behaviour is unacceptable
for two reasons: a> When a user puts a kv, receives success response, and fails to read
it out, he'll be confused why and it's hard for him to realize that the reason is someone
or himself ever wrote a delete before; b> If delete can mask puts happened after that delete,
this means once a delete is written to HBase(till it's collected by major compact), it can
block that kv be added back to HBase again forever(by semantic) even though that kv can be
added back to HBase successfully using 'put' operation(by syntactic){quote}
But do you agree for this behavior for puts?
If I put row1,cf:c,ts=5,foo; and before, someone put row1,cf:c,ts=10,bar, when I read I will
get "bar", not "foo".
It's just the same with deletes. One difference with deletes is that deletes always kill puts
with the same ts.
I could see conflicts between deletes and puts with exact same ts be resolved by time instead,
that would make sense.
But if TS is different semantics should hold, and glitches fixed separately :)

{quote}3. Yes, my fix is really to adjust the behaviour "delete can mask puts that happened
after the delete" to the one that "delete can only mask puts that happened before(or equal)
the delete". With this behaviour adjustment the inconsistency caused by major compact doesn't
appear again{quote}
It can still appear, right? If I put into memstore with ts=2, while the delete record is there
with ts=3, delete record will hide the put; but if major compaction happens delete record
disappears.
Also if I put ts=2 after major compaction it will be visible, which is also inconsistent,
so one still needs to keep the latest marker forever to avoid that.

{quote}
Actually if we set explicit timestamp, the timestamp can't be the 'current' time when the
put hit RS, so this timestamp can seldom has 'time' semantic in this sense since it's inaccurate
for time ordering. so "If you are using timestamp otherwise for some convenience, you are
misusing it" almost equals to "setting explicit timestamps is misusing it"?
{quote}
Not really; the appropriate usage of timestamp is just any kind of versioning that overrides
time based versioning.
For example, if you are doing batch processing of some data, or loading logs etc., you could
set source time as a timestamp, instead of load time.
Or maybe the data has some incremental event IDs that the source creates other than from time,
you could use those.
Or ts could come from some external time oracle used for transactions or whatever.

{quote}
when Get/Scan, by timestamp=0/-1 we know this delete is to delete the latest version and check
the kv it sees. And we know the first kv with mvcc < 'mvcc of this delete' is the 'latest'
version when the delete enters RS. 
{quote}
This makes semantics of things inconsistent. Version for puts still uses timestamps but for
deletes, MVCC is used, and latest MVCC might delete some put that is not latest by TS.

In summary, the problem can be resolved as follows as far as I see.
{quote}
2). Performance is poor for deleting a version (rather than all versions of that cell): All
delete for version need to read the timestamp before deleting, the deleteColumn() without
timestamp for deleting the latest version also need to read the latest timestamp in RS, though
transparent to the client
{quote}
1) Have API to delete specific version, and also delete latest version (by ts); the latter
will find latest timestamp inside RS, just like increment/append/checkAndPut working on existing
data.
2) Make sure delete markers and puts with exact same timestamp are resolved by mvcc or seqNum
instead of delete always winning.

The major compaction issue is mostly orthogonal to that and could be solved by TTL to keep
delete markers (latest per row).

                
      was (Author: sershe):
    Sorry for long time taken, I was out.

I think we agree that influence of major compaction is a glitch. Technically you don't have
to store all delete markers to prevent it, only latest for each key.
We can add a timeout for when delete markers are dropped to high enough, then you'd drop them
after a long time.
It's similar for TTL for deletes cells now, which allow you to do point-in-time scans on old
data after it is deleted, but not forever, only for approx. column family TTL.

For semantics:
{quote}2. The behaviour "delete can mask puts that happened after the delete" is unacceptable
for many users. When a user puts a kv to HBase, his intention is to ADD that kv to HBase and
definitely he want to be able to retrieve that kv back using a Get/Scan operation without
regard to whether or not there is a delete ever occurred. Why current behaviour is unacceptable
for two reasons: a> When a user puts a kv, receives success response, and fails to read
it out, he'll be confused why and it's hard for him to realize that the reason is someone
or himself ever wrote a delete before; b> If delete can mask puts happened after that delete,
this means once a delete is written to HBase(till it's collected by major compact), it can
block that kv be added back to HBase again forever(by semantic) even though that kv can be
added back to HBase successfully using 'put' operation(by syntactic){quote}
But do you agree for this behavior for puts?
If I put row1,cf:c,ts=5,foo; and before, someone put row1,cf:c,ts=10,bar, when I read I will
get "bar", not "foo".
It's just the same with deletes. One difference with deletes is that deletes always kill puts
with the same ts.
I could see conflicts between deletes and puts with exact same ts be resolved by time instead,
that would make sense.
But if TS is different semantics should hold, and glitches fixed separately :)

{quote}3. Yes, my fix is really to adjust the behaviour "delete can mask puts that happened
after the delete" to the one that "delete can only mask puts that happened before(or equal)
the delete". With this behaviour adjustment the inconsistency caused by major compact doesn't
appear again{quote}
It can still appear, right? If I put into memstore with ts=2, while the delete record is there
with ts=3, delete record will hide the put; but if major compaction happens delete record
disappears.
Also if I put ts=2 after major compaction it will be visible, which is also inconsistent,
so one still needs to keep the latest marker forever to avoid that.

{quote}
Actually if we set explicit timestamp, the timestamp can't be the 'current' time when the
put hit RS, so this timestamp can seldom has 'time' semantic in this sense since it's inaccurate
for time ordering. so "If you are using timestamp otherwise for some convenience, you are
misusing it" almost equals to "setting explicit timestamps is misusing it"?
{quote}
Not really; the appropriate usage of timestamp is just any kind of versioning that overrides
time based versioning.
For example, if you are doing batch processing of some data, or loading logs etc., you could
set source time as a timestamp, instead of load time.
Or maybe the data has some incremental event IDs that the source creates other than from time,
you could use those.
Or ts could come from some external time oracle used for transactions or whatever.

{quote}
when Get/Scan, by timestamp=0/-1 we know this delete is to delete the latest version and check
the kv it sees. And we know the first kv with mvcc < 'mvcc of this delete' is the 'latest'
version when the delete enters RS. 
{quote}
This makes semantics of things inconsistent. Version for puts still uses timestamps but for
deletes, MVCC is used, and latest MVCC might delete some put that is not latest by TS.

In summary, the problem can be resolved as follows as far as I see.
{quote}
2). Performance is poor for deleting a version (rather than all versions of that cell): All
delete for version need to read the timestamp before deleting, the deleteColumn() without
timestamp for deleting the latest version also need to read the latest timestamp in RS, though
transparent to the client
{quote}
1) Have API to delete specific version, and also delete latest version (by ts); the latter
will find latest timestamp inside RS, just like increment/append/checkAndPut working on existing
data.
2) Make sure delete markers and puts with exact same timestamp are resolved by mvcc or seqNum
instead of delete always winning.

The major compaction issue is mostly orthogonal to that and could be solved by TTL to keep
delete markers (latest per row).

                  
> Deletes can mask puts that happen after the delete
> --------------------------------------------------
>
>                 Key: HBASE-8721
>                 URL: https://issues.apache.org/jira/browse/HBASE-8721
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Feng Honghua
>         Attachments: HBASE-8721-0.94-V0.patch
>
>
> this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
> "Deletes mask puts, even puts that happened after the delete was entered. Remember that
a delete writes a tombstone, which only disappears after then next major compaction has run.
Suppose you do a delete of everything <= T. After this you do a new put with a timestamp
<= T. This put, even if it happened after the delete, will be masked by the delete tombstone.
Performing the put will not fail, but when you do a get you will notice the put did have no
effect. It will start working again after the major compaction has run. These issues should
not be a problem if you use always-increasing versions for new puts to a row. But they can
occur even if you do not care about time: just do delete and put immediately after each other,
and there is some chance they happen within the same millisecond."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message