hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takayuki Tsunakawa" <tsunakawa.ta...@jp.fujitsu.com>
Subject Re: How is column timestamp useful?
Date Fri, 07 May 2010 04:56:18 GMT
Hello, ryan-san

Thank you for your kind and polite reply.

I read the Bigtable paper and found two use cases of column timestamp
in Google.

[use case 1]
In our Webtable example, we can set the timestamps of the crawled
pages stored in the contents: column to the times at which these page
versions were actually crawled. The garbage-collection mechanism
described above enables us to tell Bigtable to keep only the most
recent three versions of every page.

[use case 2]
8.3 Personalized Search
Personalized Search (www.google.com/psearch) is an opt-in service that
records user queries and clicks across a variety of Google properties
such as web search, images, and news. Users can browse their search
histories to revisit their old queries and clicks, and they can ask
for personalized search results based on their historical Google usage
Personalized Search stores each user's data in Bigtable. Each user has
a unique userid and is assigned a row named by that userid. All user
actions are stored in a table. A separate column family is reserved
for each type of action (e.g., there is a column family that stores
all web queries). Each data element uses as its Bigtable timestamp the
time at which the corresponding user action occurred. Personalized
Search generates user profiles using a MapReduce over Bigtable. These
user profiles are used to personalize live search results.

In use case 1, I don't understand why three versions of each web page
need to be saved, so this is not a helpful example.
Use case 2 is interesting. This shows that the column timestamp can be
utilized to accumulate the events associated with each subject.
However, as you pointed out, this structure has the possibility to
lead to big rows. So, this usage pattern is applicable when the number
of accumulated events can be limited. When storing machine logs (e.g.
CPU load, disk usage, network bandwidth usage), the time of event
perhaps should be part of row key (i.e. one row per event). In this
sense, timestamp feature is not a necessity for Personalized Search.
For example, the data may be structured as follows:

row key "<userid>-<time_of_event>"
column family "action:"
  column "action_type" (e.g. click, web search)
  column "action_data" (e.g. clicked URL, web search query)

This structure eliminates the concern about big rows.
# I wonder if there is any difference in the simplicity of application

> The versioning of HBase is integral to the storage mechanism behind
> (and also cassandra and all bigtable like systems).

Do you mean that the versioning was invented mainly for the
implementation of Bigtable/HBase and not for the users's sake? If the
number of maximum versions is set to one when creating tables, is
there any bad effects due to the Bigtable/HBase implementation (e.g.
performance)? If there is no bad impact, I feel it's better for the
default to be one rather than three. And those who want to use
versioning should specify maximum versions when creating tables. That
reduces the memtable size and disk storage space by storing only one

Any opinion and information is appreciated.


----- Original Message ----- 
From: "Ryan Rawson" <ryanobjc@gmail.com>
To: <hbase-user@hadoop.apache.org>
Sent: Friday, May 07, 2010 11:42 AM
Subject: Re: How is column timestamp useful?

> Have a look at the bigtable paper, it should help you understand
> somewhat why things are the way they are.
> The versioning of HBase is integral to the storage mechanism behind
> (and also cassandra and all bigtable like systems).  HBase stores
> data on HDFS which has immutable files. Thus "overwriting" old
> just does not exist.  So a versioning mechanism was introduced (all
> part of the original BT paper) to allow you to supersede and delete
> (via adding special delete markers) old values. A process known as
> 'compaction' removes excessive versions and deleted values - this
> compaction is run by default once a day (it is IO intensive).
> If you don't care about timestamp, you can just ignore them and use
> HBase like any storage system - with a small caveat: excessive
> creation can cause issues (think hundreds of megs of versions in one
> row - a region would end up being 1 row and larger than the max size
> for a region and thus un-splittable).  So avoid that.
> But other relational databases use versioning, for example the MVCC
> Postgres cause multiple version of a value. Normally this is
> completely hidden and is used primarily to implement TX isolation,
> also is operationally exposed to the administrator - the vacuum
> command.
> Looking at the wiki entry for Temporal database, I can say that
> (and bigtable) are NOT temporal databases by their example. When you
> delete a row, it is removed and thus the data goes away.  There is a
> time component, but I encourage people to think of it as versioning
> and backup against application bugs - excessive use of the time
> dimension can cause problems (by making a single row larger than the
> max size of a region).
> -ryan
> 2010/5/6 Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com>:
>> Hello,
>> I'm new to HBase, so excuse me if I make odd questions.
>> I'm evaluating HBase from its documentation, and am attracted by
>> broad functionality such as transaction support, secondary index,
>> API, MapReduce integration, etc. When I recommended HBase to my
>> colleagues for the internal project, I was asked a question about
>> column timestamp (version) is useful. They said "One of the good
>> things of key-value stores is the simple and flexible data
>> But HBase has more structural elements than RDB, column family and
>> timestamp, and those additional elements HBase a bit more difficult
>> than RDB. I understand the usefulness of column family, however, in
>> what situations is timestamp used? Is it really necessary?"
>> I couldn't answer their question. Then I searched HBase web site,
>> HBase user mailing list archive, other web sites with keyword
>> timestamp", and Cassandra's web site for help. But I could not find
>> any information about how the column timestamp (versioning) is
>> Could you tell me in what situations the timestamp is absolutely
>> necessary or at least desirable? Some real world examples are much
>> appreciated.
>> From the search results, many people don't seem to use the
>> feature. However, the default maximum versions for each column is
>> If versioning is rarely utilized, doesn't it mean that the storage
>> space for extra two versions is wasted and the default should be
>> Please give me your opinions.
>> Is HBase's timestamp feature intended for the following "temporal
>> database"? If so, how do you structure the Person table in the
>> following page?
>> http://en.wikipedia.org/wiki/Temporal_database
>> Regards

View raw message