hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Why does a delete behave like this?
Date Tue, 10 Dec 2013 04:16:59 GMT
I ran the following shell command to create the table:
hbase(main):001:0> create 't1', {NAME => 'cf', KEEP_DELETED_CELLS => true}

The second get command returns the same result as the first.

Lars:
The refguide doesn't cover such usage. Do you think we should document it ?

Cheers


On Mon, Dec 9, 2013 at 2:53 PM, lars hofhansl <larsh@apache.org> wrote:

> This is because by default a delete marker extends all the way back time.
> When you set KEEP_DELETED_CELLS for your column family this behavior is
> fixed. I.e. you get correct timerange query behavior even w.r.t. to deletes.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Niels Basjes <Niels@basjes.nl>
> To: user <user@hbase.apache.org>
> Sent: Monday, December 9, 2013 12:47 AM
> Subject: Why does a delete behave like this?
>
>
> Hi,
>
> When I first started learning about HBase I compared the logic of setting
> new values to something that is similar to the way a tool like Subversion
> works: When you set a new value you don't overwrite the old one, you simply
> create a new version.
> Just like subversion you can then at a later moment retrieve the old value
> that way the situation at an earlier date.
>
> (The only real variation to the SVN model is that HBase only retains the
> last N versions of a cell.)
>
> There is however one situation where this comparison really fails: When you
> do a delete on a cell.
> If you want to retrieve the state of a thing from subversion and in the
> current version this thing has been deleted then you can still get it back.
> With HBase however if you delete a cell you place a tombstone at a specific
> time and as such internally the older values are still present.
>
> But when you try to retrieve such an older value then you still get an
> empty result back (i.e. no such cell).
> The direct consequence of the currently implemented model is that an
> application can never retrieve the correct state of a row at an older
> timestamp if a delete on any cell has occurred.
>
> Example:
>
> I create a table with one row:
>
> > create 't1', 'cf'
> > put 't1', 'rowid', 'cf:1', 'One', 1000
> > put 't1', 'rowid', 'cf:2', 'Two', 2000
> > put 't1', 'rowid', 'cf:3', 'Three', 3000
> > get 't1', 'rowid' , {TIMERANGE => [0,3500]}
>
>     COLUMN                     CELL
>      cf:1                      timestamp=1000, value=One
>      cf:2                      timestamp=2000, value=Two
>      cf:3                      timestamp=3000, value=Three
>     3 row(s) in 0.0150 seconds
>
> Then the delete of a cell at a later timestamp:
>
> > delete 't1', 'rowid', 'cf:1', 4000
>
> Now if I retrieve the row at time 3500 I would find it logical that I would
> still see the same values as I would above.
> This is however the reality:
>
> > get 't1', 'rowid' , {TIMERANGE => [0,3500]}
>
>     COLUMN                     CELL
>      cf:2                      timestamp=2000, value=Two
>      cf:3                      timestamp=3000, value=Three
>     2 row(s) in 0.0120 seconds
>
>
> Why has it been designed/implemented like this?
> What is the logic behind this model?
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message