hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-47) option to set TTL for columns in hbase
Date Mon, 28 Apr 2008 18:57:55 GMT

    [ https://issues.apache.org/jira/browse/HBASE-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592873#action_12592873

Andrew Purtell commented on HBASE-47:

Please see attached patch "hbase-ttl-0.1.patch". 

These items were easy:

- Add get and set TTL to HColumnDescriptor
- Change shell's CREATE TABLE statement so it takes a TTL parameter for column families (and
also ALTER TABLE, etc.)
- Compactor should screen out cells past TTL

This item was more involved due to memcache:

- Update HStore methods (get, put, etc) to check HStoreKey's timestamps against TTL value
when doing anything

I have done some simple testing and it works for me. I have not written a more involved test
case yet because there are two changes I had to make to Memcache which might be not so nice,
so maybe this is ok or maybe it needs more work or some rethinking.

First, in order to enforce TTLs, Memcache has to be aware of them. I noticed that effectively
a Memcache is associated with a HStore, so I made the association explicit by converting the
Memcache constructor to accept a HStore as the sole parameter. In this way Memcache can pick
up TTLs from the HColumnDescriptor associated with its associated HStore. 

Second, I modified Memcache to discard cells with expired TTLs from the map. Otherwise with
short TTLs and small values maybe many hundreds or thousands of expired cells may accumulate
and need to be repeatedly and unnecessarily iterated through for every getFull, or get, etc.
But, due to how the code is structured, currently expired cells may be dropped from both the
memcache map and the snapshot map. Possibly removing entries from the snapshot map violates
some assumptions made elsewhere. Avoiding this in my estimation would require some refactoring.

> option to set TTL for columns in hbase
> --------------------------------------
>                 Key: HBASE-47
>                 URL: https://issues.apache.org/jira/browse/HBASE-47
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: hql, regionserver
>            Reporter: Billy Pearson
>            Priority: Minor
> I would like to see the option to have a TTL on the columns in hbase this feature could
be helpfully in removing stale data from large datasets with out havening to do a full scan
of the dataset and then issuing deletes.
> Example 
> Say I am crawling pages and only refreshing pages based on a set score and some pages
doe not get updated over X days the old version of the page gets removed from the data set.

> Say I am striping out links form html and storing them say a link is removed from a page
then I would need to issue a delete statement to remove that links form the data set with
a ttl the link data would remove its self if not updated in x secs. These are just examples
based on crawling like nutch but I can foresee many apps using this option. 
> This is a feature in bigtables thats is handled when bigtable does garbage-collection.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message