hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ophir Cohen <oph...@gmail.com>
Subject Re: Data retention in HBase
Date Wed, 11 May 2011 13:14:00 GMT
My results from today's researches:

I tried to delete region as Stack suggested:

   1. *close_region*
   2. Remove files from file system.
   3. *assign* the region again.

It looks like it works!
The region still exists but its empty.

Looks good but definitely not the end of the way.
In order to finalize this solution I still have those questions:


   1. Can I split a region by a specific key? It looks that it split
   automatically.
   2. It seems that splitting from command line does not work... I get the
   message in the log but nothing really happened. Actually in the code it
   stated that it triggered compaction and that should be enough (????).
   3. Is there a way to choose my method of region splitting? I think it can
   be a great option - way to state when and how region is splitted...

Any thoughts?
Thanks,
Ophir

BTW



On Tue, May 10, 2011 at 6:50 PM, Ophir Cohen <ophchu@gmail.com> wrote:

> OK, so to summarize the discussion (and rise some more problems) here is
> what I gathered:
>
> I have two options:
>
> 1. I can use map/reduce job on the rows I want to delete.
> Main problem here: after each job I need to run major compaction that will
> stop service at compaction time.
>
> Question:
>
>    - Why does major compaction should stop service (BTW I mainly concern
>    on insertions, with reading deny of service I can leave)?
>
> 2. Split for specific region and delete that region.
>
> Questions here:
>
>    - How does the META table updated after I close the region and remove
>    the files? Should I remove it from the META table as well?
>    - Why I need to disable the table? For how much time, do you think, I
>    need to disable it? Can I bypass it?
>
> I'm going to execute some tests tomorrow on that subject so any comments
> will be helpful.
> I'll keep you updated with the results.
>
> Thanks again,
> Ophir
>
>
> On Mon, May 9, 2011 at 8:34 PM, Ted Dunning <tdunning@maprtech.com> wrote:
>
>> If you change your key to "date - customer id - time stamp - session id"
>> then you shouldn't lose any important
>> data locality, but you would be able to delete things more efficiently.
>>
>> For one thing, any map-reduce programs that are running for deleting would
>> be doing dense scans over a small
>> part of your data. That might make them run much faster.
>>
>> For another, you should be able to do the region switch trick and then
>> drop
>> entire regions. That has the unfortunate
>> side-effect of requiring that you disable the table for a short period (I
>> think).
>>
>> On Mon, May 9, 2011 at 10:09 AM, Ophir Cohen <ophchu@gmail.com> wrote:
>>
>> > Thanks for the answer!
>> >
>> > A little bit more info:
>> > Our data is internal events grouped for sessions (i.e. group of events).
>> > There is differnet sessions to differnet customers.
>> > We talking about millions sessions per day.
>> >
>> > The key is *customer id - time stamp - sessions id.
>> > *
>> > So, yes it sorted by customer and date and as I want to remove rows by
>> > customer and date - it sorted all right.
>> > Actually the main motivation to remove old rows is that we have storage
>> > limitations (and too much data...).
>> >
>> > So, my concern if we can do something better than nightly/weekly map
>> reduce
>> > job that will ends up with a major compaction.
>> > Ophir
>> > PS
>> > The majorty of my customers share the same retention policy but I still
>> > need
>> > abilty to change it for a specific customer.
>> >
>> >
>> > On Mon, May 9, 2011 at 6:48 PM, Ted Dunning <tdunning@maprtech.com>
>> wrote:
>> >
>> > > Can you say a bit more about your data organization?
>> > >
>> > > Are you storing transactions of some kind? If so an your key involve
>> > > time?
>> > > I think that putting some extract of time (day number perhaps) as a
>> > > leading
>> > >
>> > > Are you storing profiles where the key is the user (or something) id
>> and
>> > > the
>> > > data is essentially a list of transactions? If so, can you segregate
>> > > transactions into separate column families that can be dropped as data
>> > > expires?
>> > >
>> > > When you say data expiration varies by customer, is that really
>> necessary
>> > > or
>> > > can you have a lowest common denominator for actual deletions with
>> rules
>> > > that govern how much data is actually visible to the consumer of the
>> > data?
>> > >
>> > > On Mon, May 9, 2011 at 2:59 AM, Ophir Cohen <ophchu@gmail.com> wrote:
>> > >
>> > > > Hi All,
>> > > > In my company currently we are working hard on deployment our
>> cluster
>> > > with
>> > > > HBase.
>> > > >
>> > > > We talking of ~20 nodes to hold pretty big data (~1TB per day).
>> > > >
>> > > > As there is a lot of data, we need a retention method, i.e. a way
to
>> > > remove
>> > > > old data.
>> > > >
>> > > > The problem is that I can't/want to do it using TTL cause two
>> reasons:
>> > > >
>> > > > 1. Different retention policy for different customers.
>> > > > 2. Policy might be changed.
>> > > >
>> > > >
>> > > > Of course, I can do it using nightly (weekly?) MR job that runs on
>> all
>> > > data
>> > > > and remove the old data.
>> > > > There is few problems:
>> > > >
>> > > > 1. Running on huge amount of data only to remove small portion of
>> it.
>> > > > 2. It'll be a heavily MR job.
>> > > > 3. Need to perform main compaction afterwards - that will affect
>> > > > performance or even stop service (is that right???).
>> > > >
>> > > > I might use BulkFileOutputFormat for that job - but still have those
>> > > > problems.
>> > > >
>> > > > As my data sorted by the retention policies (customers and time) I
>> > > thought
>> > > > of this option:
>> > > >
>> > > > 1. Split regions and create region with 'candidates to removed'.
>> > > > 2. Drop this region.
>> > > >
>> > > >
>> > > > - Is it possible to drop region?
>> > > > - Do you think it a good idea?
>> > > > - Any other ideas?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Ophir Cohen
>> > > > LivePerson
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message