hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ophir Cohen <oph...@gmail.com>
Subject Data retention in HBase
Date Mon, 09 May 2011 09:59:25 GMT
Hi All,
In my company currently we are working hard on deployment our cluster with

We talking of ~20 nodes to hold pretty big data (~1TB per day).

As there is a lot of data, we need a retention method, i.e. a way to remove
old data.

The problem is that I can't/want to do it using TTL cause two reasons:

   1. Different retention policy for different customers.
   2. Policy might be changed.

Of course, I can do it using nightly (weekly?) MR job that runs on all data
and remove the old data.
There is few problems:

   1. Running on huge amount of data only to remove small portion of it.
   2. It'll be a heavily MR job.
   3. Need to perform main compaction afterwards - that will affect
   performance or even stop service (is that right???).

I might use BulkFileOutputFormat for that job - but still have those

As my data sorted by the retention policies (customers and time) I thought
of this option:

   1. Split regions and create region with 'candidates to removed'.
   2. Drop this region.

   - Is it possible to drop region?
   - Do you think it a good idea?
   - Any other ideas?


Ophir Cohen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message