hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ophir Cohen <oph...@gmail.com>
Subject Re: Data retention in HBase
Date Tue, 10 May 2011 15:50:21 GMT
OK, so to summarize the discussion (and rise some more problems) here is
what I gathered:

I have two options:

1. I can use map/reduce job on the rows I want to delete.
Main problem here: after each job I need to run major compaction that will
stop service at compaction time.

Question:

   - Why does major compaction should stop service (BTW I mainly concern on
   insertions, with reading deny of service I can leave)?

2. Split for specific region and delete that region.

Questions here:

   - How does the META table updated after I close the region and remove the
   files? Should I remove it from the META table as well?
   - Why I need to disable the table? For how much time, do you think, I
   need to disable it? Can I bypass it?

I'm going to execute some tests tomorrow on that subject so any comments
will be helpful.
I'll keep you updated with the results.

Thanks again,
Ophir


On Mon, May 9, 2011 at 8:34 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> If you change your key to "date - customer id - time stamp - session id"
> then you shouldn't lose any important
> data locality, but you would be able to delete things more efficiently.
>
> For one thing, any map-reduce programs that are running for deleting would
> be doing dense scans over a small
> part of your data. That might make them run much faster.
>
> For another, you should be able to do the region switch trick and then drop
> entire regions. That has the unfortunate
> side-effect of requiring that you disable the table for a short period (I
> think).
>
> On Mon, May 9, 2011 at 10:09 AM, Ophir Cohen <ophchu@gmail.com> wrote:
>
> > Thanks for the answer!
> >
> > A little bit more info:
> > Our data is internal events grouped for sessions (i.e. group of events).
> > There is differnet sessions to differnet customers.
> > We talking about millions sessions per day.
> >
> > The key is *customer id - time stamp - sessions id.
> > *
> > So, yes it sorted by customer and date and as I want to remove rows by
> > customer and date - it sorted all right.
> > Actually the main motivation to remove old rows is that we have storage
> > limitations (and too much data...).
> >
> > So, my concern if we can do something better than nightly/weekly map
> reduce
> > job that will ends up with a major compaction.
> > Ophir
> > PS
> > The majorty of my customers share the same retention policy but I still
> > need
> > abilty to change it for a specific customer.
> >
> >
> > On Mon, May 9, 2011 at 6:48 PM, Ted Dunning <tdunning@maprtech.com>
> wrote:
> >
> > > Can you say a bit more about your data organization?
> > >
> > > Are you storing transactions of some kind? If so an your key involve
> > > time?
> > > I think that putting some extract of time (day number perhaps) as a
> > > leading
> > >
> > > Are you storing profiles where the key is the user (or something) id
> and
> > > the
> > > data is essentially a list of transactions? If so, can you segregate
> > > transactions into separate column families that can be dropped as data
> > > expires?
> > >
> > > When you say data expiration varies by customer, is that really
> necessary
> > > or
> > > can you have a lowest common denominator for actual deletions with
> rules
> > > that govern how much data is actually visible to the consumer of the
> > data?
> > >
> > > On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <ophchu@gmail.com> wrote:
> > >
> > > > Hi All,
> > > > In my company currently we are working hard on deployment our cluster
> > > with
> > > > HBase.
> > > >
> > > > We talking of ~20 nodes to hold pretty big data (~1TB per day).
> > > >
> > > > As there is a lot of data, we need a retention method, i.e. a way to
> > > remove
> > > > old data.
> > > >
> > > > The problem is that I can't/want to do it using TTL cause two
> reasons:
> > > >
> > > > 1. Different retention policy for different customers.
> > > > 2. Policy might be changed.
> > > >
> > > >
> > > > Of course, I can do it using nightly (weekly?) MR job that runs on
> all
> > > data
> > > > and remove the old data.
> > > > There is few problems:
> > > >
> > > > 1. Running on huge amount of data only to remove small portion of it.
> > > > 2. It'll be a heavily MR job.
> > > > 3. Need to perform main compaction afterwards - that will affect
> > > > performance or even stop service (is that right???).
> > > >
> > > > I might use BulkFileOutputFormat for that job - but still have those
> > > > problems.
> > > >
> > > > As my data sorted by the retention policies (customers and time) I
> > > thought
> > > > of this option:
> > > >
> > > > 1. Split regions and create region with 'candidates to removed'.
> > > > 2. Drop this region.
> > > >
> > > >
> > > > - Is it possible to drop region?
> > > > - Do you think it a good idea?
> > > > - Any other ideas?
> > > >
> > > > Thanks,
> > > >
> > > > Ophir Cohen
> > > > LivePerson
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message