hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Why does a delete behave like this?
Date Mon, 09 Dec 2013 22:53:50 GMT
This is because by default a delete marker extends all the way back time.
When you set KEEP_DELETED_CELLS for your column family this behavior is fixed. I.e. you get
correct timerange query behavior even w.r.t. to deletes.

-- Lars

 From: Niels Basjes <Niels@basjes.nl>
To: user <user@hbase.apache.org> 
Sent: Monday, December 9, 2013 12:47 AM
Subject: Why does a delete behave like this?


When I first started learning about HBase I compared the logic of setting
new values to something that is similar to the way a tool like Subversion
works: When you set a new value you don't overwrite the old one, you simply
create a new version.
Just like subversion you can then at a later moment retrieve the old value
that way the situation at an earlier date.

(The only real variation to the SVN model is that HBase only retains the
last N versions of a cell.)

There is however one situation where this comparison really fails: When you
do a delete on a cell.
If you want to retrieve the state of a thing from subversion and in the
current version this thing has been deleted then you can still get it back.
With HBase however if you delete a cell you place a tombstone at a specific
time and as such internally the older values are still present.

But when you try to retrieve such an older value then you still get an
empty result back (i.e. no such cell).
The direct consequence of the currently implemented model is that an
application can never retrieve the correct state of a row at an older
timestamp if a delete on any cell has occurred.


I create a table with one row:

> create 't1', 'cf'
> put 't1', 'rowid', 'cf:1', 'One', 1000
> put 't1', 'rowid', 'cf:2', 'Two', 2000
> put 't1', 'rowid', 'cf:3', 'Three', 3000
> get 't1', 'rowid' , {TIMERANGE => [0,3500]}

    COLUMN                     CELL
     cf:1                      timestamp=1000, value=One
     cf:2                      timestamp=2000, value=Two
     cf:3                      timestamp=3000, value=Three
    3 row(s) in 0.0150 seconds

Then the delete of a cell at a later timestamp:

> delete 't1', 'rowid', 'cf:1', 4000

Now if I retrieve the row at time 3500 I would find it logical that I would
still see the same values as I would above.
This is however the reality:

> get 't1', 'rowid' , {TIMERANGE => [0,3500]}

    COLUMN                     CELL
     cf:2                      timestamp=2000, value=Two
     cf:3                      timestamp=3000, value=Three
    2 row(s) in 0.0120 seconds

Why has it been designed/implemented like this?
What is the logic behind this model?

Best regards / Met vriendelijke groeten,

Niels Basjes
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message