hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Using HBase timestamps as natural versioning
Date Sun, 11 Aug 2013 05:21:22 GMT
If you want deletes to work correctly you should enable KEEP_DELETED_CELLS for your column
families (I still think that should be the default anyway).
Otherwise time-range queries will not be correct w.r.t. deleted data (specifically you cannot
get back at deleted data even if you specify a time range before the delete and even if you
column family as unlimited versions).

Depending on what your typical queries are, you might run into performance issues. HBase sorts
all versions of a KeyValue adjacent to each other.
If you now want to query only along the latest data (the last version), HBase will have to
skip a lot of other versions. In the worst case the latest version of all KeyVales are on
separate (HFile) blocks.

The question of whether to use the builtin timestamps or model the time as part of the row
keys (or even a time-column), is an interesting one.
Generally the row-key identifies your row. If you want a new row for each TS in your logical
model you should manage the time dimension yourself.
Otherwise if you identities (i.e. row) with many versions, the builtin TS might be better.

-- Lars

From: Henning Blohm <henning.blohm@zfabrik.de>
To: user <user@hbase.apache.org> 
Sent: Saturday, August 10, 2013 6:26 AM
Subject: Using HBase timestamps as natural versioning


we are managing some naturally time versioned data in HBase. That is, 
there are change events that have a specific time set and when such 
event is handled, data in HBase, pertaining to the exact same point in 
time, is updated.

So far we are using HBase time stamps to model the time dimension. All 
columns have unlimited number of versions. That worked ok so far, and 
HBase's way of providing access to data at a given time or time range 
seemed a natural fit.

We are aware of some tricky issues around timestamp handling (e.g. in 
particular in conjunction with deletes). As we need to migrate HBase 
stored data (for other reasons) shortly we are wondering, if our 
approach has some long-term drawbacks that we should pay attention to 
now and possibly re-design our timestamp handling as well.

So my question is:

* Is there problematic experience with using HBase timestamps as time 
dimension of your data (assuming it has some natural time-based versioning)?

* Is it generally better to model time-based versioning of data within 
the data structure itself (e.g. in the row key) and why?

* In case you used HBase timestamps similar to the way we use them, 
feedback on how that worked is welcome as well!


View raw message