hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: How is column timestamp useful?
Date Fri, 07 May 2010 16:54:55 GMT
I would argue that the primary reasons for versioning has nothing to do with "rescuing users"
or being able to recover data.

To reiterate what others have said, the reasons that HBase/BigTable is versioned is because
of the immutable nature of data (an update is a newer version on top of the old version, not
actually an update) and the original web crawling use case where they wanted to keep historical

As you say, it is certainly possible to model most timestamp-based schemas without using the
built-in versioning (by adding it to the row key or to the column qualifiers).

But to revisit the crawl example again, imagine our requirement is that we want to keep the
last 10 crawls of every site.  If I was storing each crawl in a row that included the stamp
of the crawl, I would need my own background process to garbage collect any crawl that was
not one of the 10 most recent.  By utilizing integrated version limits, I can set maxVersions
to be 10, and as a background process HBase will automatically garbage collect away old crawls
beyond the threshold I set.

As far as pushing timestamps into rows in order to avoid large rows, this is a fair point,
but remember it is the goal of HBase to support rows with millions of columns and versions
(if you are considering billions of versions in one row, then perhaps this is no longer a
sane use of the integrated versioning).  While this row cannot be split across two regionservers,
often times this is okay or even desirable.  For example, if my row is a userid, I may want
a given user to only live on a single machine rather than be spread across multiple machines.
 Among other reasons, this provides better overall availability for users as a single machine
failure only impacts the users who live on that machine (if each user was spread across machines,
availability of each machine impacts a much larger percentage of users).

Hope that helps.


> -----Original Message-----
> From: Takayuki Tsunakawa [mailto:tsunakawa.takay@jp.fujitsu.com]
> Sent: Friday, May 07, 2010 12:04 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: How is column timestamp useful?
> All,
> Thank you for giving lots of opinions and information. I'll try to
> persuade my colleagues as follows:
> I couldn't find any good examples where versioning should be
> definitely utilized. However, HBase community members gave me the idea
> on how versioning is useful.
> 1. Recover data lost by accidental deletions or updates
>    (I think this is the most persuading reason)
> 2. Auditing (change tracking) )for compliance
>    However, this is not persuading, because advanced RDBMSs provide
> audit trails, not versioning. Versioning itself does not show who
> changed the data how.
> 3. Recording events (as in Google's persolalized search)
>    This is not persuading, too. As I wrote in the previous mail,
> embedding time of event in row key may be better because it prevent
> the rows from becoming big.
> If versioning is not necessary from your requirement, you can ignore
> timestamps (do not have to specify timestamp in API call).
> Although HBase keeps three versions by default and it may be a bit
> wasteful for memory and disk, turning on compression for column
> families can minimize the waste as much as you can ignore (is it
> true?).
> If saving memory (=keep memtable as small as possible) is important,
> you can set the maximum number of versions to 1.
> The reason that the default is 3 is to rescue users from their
> mistakes.
> (If users accidentally delete or update data, you have to develop a
> tool that pulls previous data records.)
> Regards
> Takayuki

View raw message