hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ophir Cohen <oph...@gmail.com>
Subject Re: Data retention in HBase
Date Mon, 09 May 2011 17:09:51 GMT
Thanks for the answer!

A little bit more info:
Our data is internal events grouped for sessions (i.e. group of events).
There is differnet sessions to differnet customers.
We talking about millions sessions per day.

The key is *customer id - time stamp - sessions id.
*
So, yes it sorted by customer and date and as I want to remove rows by
customer and date - it sorted all right.
Actually the main motivation to remove old rows is that we have storage
limitations (and too much data...).

So, my concern if we can do something better than nightly/weekly map reduce
job that will ends up with a major compaction.
Ophir
PS
The majorty of my customers share the same retention policy but I still need
abilty to change it for a specific customer.


On Mon, May 9, 2011 at 6:48 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> Can you say a bit more about your data organization?
>
> Are you storing transactions of some kind?   If so an your key involve
> time?
>  I think that putting some extract of time (day number perhaps) as a
> leading
>
> Are you storing profiles where the key is the user (or something) id and
> the
> data is essentially a list of transactions?  If so, can you segregate
> transactions into separate column families that can be dropped as data
> expires?
>
> When you say data expiration varies by customer, is that really necessary
> or
> can you have a lowest common denominator for actual deletions with rules
> that govern how much data is actually visible to the consumer of the data?
>
> On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <ophchu@gmail.com> wrote:
>
> > Hi All,
> > In my company currently we are working hard on deployment our cluster
> with
> > HBase.
> >
> > We talking of ~20 nodes to hold pretty big data (~1TB per day).
> >
> > As there is a lot of data, we need a retention method, i.e. a way to
> remove
> > old data.
> >
> > The problem is that I can't/want to do it using TTL cause two reasons:
> >
> >   1. Different retention policy for different customers.
> >   2. Policy might be changed.
> >
> >
> > Of course, I can do it using nightly (weekly?) MR job that runs on all
> data
> > and remove the old data.
> > There is few problems:
> >
> >   1. Running on huge amount of data only to remove small portion of it.
> >   2. It'll be a heavily MR job.
> >   3. Need to perform main compaction afterwards - that will affect
> >   performance or even stop service (is that right???).
> >
> > I might use BulkFileOutputFormat for that job - but still have those
> > problems.
> >
> > As my data sorted by the retention policies (customers and time) I
> thought
> > of this option:
> >
> >   1. Split regions and create region with 'candidates to removed'.
> >   2. Drop this region.
> >
> >
> >   - Is it possible to drop region?
> >   - Do you think it a good idea?
> >   - Any other ideas?
> >
> > Thanks,
> >
> > Ophir Cohen
> > LivePerson
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message