hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Baranau <alex.barano...@gmail.com>
Subject Re: Hbase Data Model to purge old data.
Date Thu, 26 Jul 2012 22:55:31 GMT
Very nice presentation. Awesome simulation tool!

Couldn't help to leave a comment. Or two.

1. It is even possible to set qualifier name to empty byte[]. This might
help to save you some extra byte(s) ;)

2. It looks like after several days you have in memstore a lot of data
which is not frequently accessed. I.e. those memstores of the regions that
holds several days+ old data. Would be great to use this valuable main
memory for storing frequently accessed data. Quick thoughts:
* perform manual flush of older regions' memstores periodically, this will
free that memory and then use it:
  ** for bigger memstore (I believe that should esp. improve your timings
for fetching data older than hour (there's kinda a spike on fetch time
chart there))
  ** for bigger block caches
  ** having more "hot" regions per RS

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

P.S. Any chance of converting the first video of simulation tool to gif or
smth and allow using for teaching? ;)

P.S.-2 Have you tried to connect in to the real cluster already? I know we
are all busy, but still hopes are that you'll find the time. Btw, I believe
it will be soon easier to integrate it as hbase metrics are getting a lot
of attention. They should be much more usable soon.

On Thu, Jul 26, 2012 at 1:06 PM, Cristofer Weber <
cristofer.weber@neogrid.com> wrote:

> Hi there
>
> There are some really good ideas in this presentation from HBaseCon:
> http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/
>
> Regards,
> Cristofer
>
> -----Mensagem original-----
> De: Alex Baranau [mailto:alex.baranov.v@gmail.com]
> Enviada em: quinta-feira, 26 de julho de 2012 11:28
> Para: user@hbase.apache.org
> Assunto: Re: Hbase Data Model to purge old data.
>
> > reason for
> > this is bulk delete of one days data within a big table is more
> > expensive
> than
> > dropping a one day table
>
> Sorry for the obvious question, but have you tried using TTLs instead of
> deleting rows explicitly? This should bring less load on the cluster,
> though you'll still have to run major_compaction, which might be a resource
> intensive process.
>
> > In this per-day-separate-table model, the load balancer will never get
> triggered
> > as the current days table is always in memory, and daughter regions
> > will continuously get assigned to same region server. This leads to a
> > region
> server
> > hotspots.
>
> Again, may be an obvious q: have you tried to (or is it possible in your
> case to) pre-split table so that regions are distributed over the cluster
> from the start?
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> On Thu, Jul 26, 2012 at 2:34 AM, Padmanaban <padmanaban.mathulu@gmail.com
> >wrote:
>
> > We have the following use case:
> >
> > Store telecom CDR data on a per subscriber basis data is time series
> > based and every record is per-subscriber based comes in round the
> > clock the expected volume of data would be around 300 million
> > records/day.
> > this data is to be queried 24/7 by an online system where the filters
> > are subscriber id and date range
> >
> > Since the volume of data is huge, we have data retention policies to
> > archive old data on a daily basis.
> > For example, if retention is set to 90 days, every day a offline
> > process would delete data from Hbase which is older than 90 days and
> > archive it on tape.
> >
> > The current HBase data model design is as follows:
> > Separate table for every day's data with row key as subscriber id:
> > reason for this is bulk delete of one days data within a big table is
> > more expensive than dropping a one day table In this
> > per-day-separate-table model, the load balancer will never get
> > triggered as the current days table is always in memory, and daughter
> > regions will continuously get assigned to same region server. This
> > leads to a region server hotspots.
> >
> > Please feedback on whether the per-day-separate-table model is the
> > best-practice for this use case considering the data life cycle
> > management requirement.
> > If
> > yes, how do we solve the side effect of region server hotspot? If no,
> > please advice alternate model
> >
> > Thanks in advance,
> > Padmanaban M
> >
> >
> >
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message