hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Blohm <henning.bl...@zfabrik.de>
Subject Re: Using HBase timestamps as natural versioning
Date Sat, 31 Aug 2013 10:09:25 GMT
It can be made to be of fixed length, so that the scan would work. So 
that would work.

On 08/30/2013 04:58 PM, Ted Yu wrote:
> Is your ID fixed length or variable length ?
> If the length is fixed, you can specify ID/0 as the start row in scan.
> Cheers
> On Fri, Aug 30, 2013 at 5:42 AM, Henning Blohm <henning.blohm@zfabrik.de>wrote:
>> Was gone for a few days. Sorry for not getting back to this until now. And
>> thanks for adding to the discussion!
>> The time used in the timestamp is the "natural" time (in ms resolution) as
>> far as known. I.e. in the end it is of course some machine time, but the
>> trigger to choose it is some human interaction typically. So there is some
>> natural time to events that update a row's data.
>> If timestamps happen to differ just by 1 ms, as unlikely as that may be,
>> this would still be valid.
>> And the timestamp is always set by the client (i.e. the app server) when
>> performing an HBase put. So it's never the region server time or something
>> slightly arbitrary.
>> To recap: The data model (even before mapping to HBase) is essentially
>> ID -> ( attribute -> ( time -> value ))
>> (where ID is a composite key consisting of some natural elements and some
>> surrogate part).
>> An event is something like "at time t, attribute x of  ID attained value
>> z".
>> Events may enter the system out of timely order!
>> Typical access patterns are:
>> (R1) "Get me all attributes of ID at time t"
>> (R2) "Get me a trails of attribute changes between time t0 and t1"
>> (W1) "Set x=z on ID for time t"
>> As said, currently we store data almost exactly the way I described the
>> model above (and probably that's why I wrote it down the way I did) using
>> the HBase time stamp to store to time dimension.
>> Alternative: Adding the time dimension to the row key
>> -----------
>> That would mean: ID/time -> (attribute -> value)
>> That would imply to either have copies of all (later) attribute values in
>> all (later) rows or to only put deltas and to scan over rows to collect
>> attribute values.
>> Let's assume the latter (for better storage and writing performance).
>> Wouldn't that mean to rebuild what HBase does? Is there nothing HBase does
>> more efficient when performing R1 for example?
>> I.e: Assume I want to get the latest state of row ID. In that case I would
>> need to scan from ID/0 to ID/<now> (or reverse) to fish for all attribute
>> values (assuming I don't know all expected attributes beforehand). Is that
>> as efficient as an HBase get with max versions 1 and <now> as time stamp?
>> Thanks,
>> Henning
>> On 08/21/2013 01:11 PM, Michael Segel wrote:
>>> I would have to disagree with Lars on this one...
>>> Its really a bad design.
>>> To your point, your data is temporal in nature. That is to say, time is
>>> an element of your data and it should be part of your schema.
>>> You have to remember that time is relative.
>>> When a row is entered in to HBase, which time is used in the timestamp?
>>> The client(s)? The RS?  Unless I am mistaken or the API has changed, you
>>> can set up any arbitrary long value to be the timestamp for a given
>>> row/cell.
>>> Like I said, its relative.
>>> Since your data is temporal what is the difference if the event happened
>>> at TS xxxxxxxx10 xxxxxxxxx11 (the point is that the TS is different by 1 in
>>> the least significant bit)
>>> You could be trying to reference the same event.
>>> To Lars point, if you make time part of your key, you could end up with
>>> hot spots. It depends on your key design. If its the least significant
>>> portion of the key, its less of an issue. (clientX | action | TS) would be
>>> an example that would sort the data by client, by action type, then by time
>>> stamp.  (EPOCH - TS ) would put the most current first.
>>> When you try to take a short cut, it usually will bite you in the ass.
>>> TANSTAAFL applies!
>>> HTH
>>> -Mike
>>> On Aug 11, 2013, at 12:21 AM, lars hofhansl <larsh@apache.org> wrote:
>>>   If you want deletes to work correctly you should enable
>>>> KEEP_DELETED_CELLS for your column families (I still think that should be
>>>> the default anyway).
>>>> Otherwise time-range queries will not be correct w.r.t. deleted data
>>>> (specifically you cannot get back at deleted data even if you specify a
>>>> time range before the delete and even if you column family as unlimited
>>>> versions).
>>>> Depending on what your typical queries are, you might run into
>>>> performance issues. HBase sorts all versions of a KeyValue adjacent to each
>>>> other.
>>>> If you now want to query only along the latest data (the last version),
>>>> HBase will have to skip a lot of other versions. In the worst case the
>>>> latest version of all KeyVales are on separate (HFile) blocks.
>>>> The question of whether to use the builtin timestamps or model the time
>>>> as part of the row keys (or even a time-column), is an interesting one.
>>>> Generally the row-key identifies your row. If you want a new row for
>>>> each TS in your logical model you should manage the time dimension yourself.
>>>> Otherwise if you identities (i.e. row) with many versions, the builtin
>>>> TS might be better.
>>>> -- Lars
>>>> ______________________________**__
>>>> From: Henning Blohm <henning.blohm@zfabrik.de>
>>>> To: user <user@hbase.apache.org>
>>>> Sent: Saturday, August 10, 2013 6:26 AM
>>>> Subject: Using HBase timestamps as natural versioning
>>>> Hi,
>>>> we are managing some naturally time versioned data in HBase. That is,
>>>> there are change events that have a specific time set and when such
>>>> event is handled, data in HBase, pertaining to the exact same point in
>>>> time, is updated.
>>>> So far we are using HBase time stamps to model the time dimension. All
>>>> columns have unlimited number of versions. That worked ok so far, and
>>>> HBase's way of providing access to data at a given time or time range
>>>> seemed a natural fit.
>>>> We are aware of some tricky issues around timestamp handling (e.g. in
>>>> particular in conjunction with deletes). As we need to migrate HBase
>>>> stored data (for other reasons) shortly we are wondering, if our
>>>> approach has some long-term drawbacks that we should pay attention to
>>>> now and possibly re-design our timestamp handling as well.
>>>> So my question is:
>>>> * Is there problematic experience with using HBase timestamps as time
>>>> dimension of your data (assuming it has some natural time-based
>>>> versioning)?
>>>> * Is it generally better to model time-based versioning of data within
>>>> the data structure itself (e.g. in the row key) and why?
>>>> * In case you used HBase timestamps similar to the way we use them,
>>>> feedback on how that worked is welcome as well!
>>>> Thanks,
>>>> Henning
>>>>   The opinions expressed here are mine, while they may reflect a
>>> cognitive thought, that is purely accidental.
>>> Use at your own risk.
>>> Michael Segel
>>> michael_segel (AT) hotmail.com
>> --
>> Henning Blohm
>> *ZFabrik Software KG*
>> T:      +49 6227 3984255
>> F:      +49 6227 3984254
>> M:      +49 1781891820
>> Lammstrasse 2 69190 Walldorf
>> henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.**de<henning.blohm@zfabrik.de>
>> Linkedin <http://www.linkedin.com/pub/**henning-blohm/0/7b5/628<http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
>> ZFabrik <http://www.zfabrik.de>
>> Blog <http://www.z2-environment.**net/blog<http://www.z2-environment.net/blog>
>> Z2-Environment <http://www.z2-environment.eu>
>> Z2 Wiki <http://redmine.z2-**environment.net<http://redmine.z2-environment.net>

Henning Blohm

*ZFabrik Software KG*

T: 	+49 6227 3984255
F: 	+49 6227 3984254
M: 	+49 1781891820

Lammstrasse 2 69190 Walldorf

henning.blohm@zfabrik.de <mailto:henning.blohm@zfabrik.de>
Linkedin <http://www.linkedin.com/pub/henning-blohm/0/7b5/628>
ZFabrik <http://www.zfabrik.de>
Blog <http://www.z2-environment.net/blog>
Z2-Environment <http://www.z2-environment.eu>
Z2 Wiki <http://redmine.z2-environment.net>

View raw message