hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: when to use hive vs hbase
Date Wed, 30 Apr 2014 13:02:34 GMT
Hi Shushant,

Have you looked at OpenTSDB? If you use timestamp in your rowkey you will
create what we call hotspots and you want to avoid that.OpenTSDB might help
you with that.

They key you propose will create Hotspot with default HBase version and you
want to avoid that. You can place the ID first but then you can not really
scan anymore. You can salt using a value between 0 and 9 in front of the
key but they you will need to do 10 more scans. So take a quick look at
OpenTSDB (it uses HBase) and see if it helps your usecase..

JM


2014-04-30 8:39 GMT-04:00 Shushant Arora <shushantarora09@gmail.com>:

> Thanks Jean !
>
> Few more questions
> what are good practices for key column design in HBase?
> Say my web logs contains timestamp and request id which uniquely identify
> each row
>
> 1.Shall I make YYYY-MM-DD-HH-MM-SS_REQ_ID as row key ? In scenario where
> this data will be fetched from HBase on daily base and will be loaded in
> MYSql DB.
> Daily my ETLruns and it will fetch record with keycol>=lastdate and
> keycol<=today ? Will this key design over load one region server ? Or it
> will be equally divided among region servers.
>
>
>
>
>
>
> On Wed, Apr 30, 2014 at 5:55 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > With HBase you have some overhead. The Region Server will do a lot for
> you.
> > Manage lal the columns families, the columns, the delete marker, the
> > compactions, etc. If you read a file directly from HDFS it will be faster
> > for sure because you will not have all those validations and all this
> extra
> > memory usage.
> >
> > HBase is absolutely perfect and is excellent to what it's build for. But
> if
> > you are doing only full table scans, it's not it's primary usecase. It
> can
> > still do it if you want, but if you do only that, it's not yet the most
> > efficient option.
> >
> > If your usecase is a mix of full scans and random read/random writes,
> then
> > yes, go with it!
> >
> > Last, some full table scan can be good fits with HBase if you use some of
> > it's specific features like TTL on certain columns families when using
> more
> > than 1, etc.
> >
> > HTH
> >
> >
> > 2014-04-30 8:13 GMT-04:00 Shushant Arora <shushantarora09@gmail.com>:
> >
> > > Hi Jean
> > >
> > > Thanks for explanation .
> > >
> > > I still  have one doubt
> > > Why HBase is not good for bulk loads and aggregations
> > > (Full table scan) ? Hive will also read each row for aggregation as
> well
> > as
> > > HBase .
> > > Can you explain more ?
> > >
> > >
> > > On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Hi Shushant,
> > > >
> > > > Hive and HBase are 2 different things. You can not really use one vs
> > > > another one.
> > > >
> > > > Hive is a query engine against HDFS data. Data can be stored with
> > > different
> > > > format like flat text, sequence files, Paquet file, or even HBase
> > table.
> > > > HBase is both a query engine (Get and scans) and a storage engine on
> > top
> > > of
> > > > HDFS which allow you to store data for random read and random write.
> > > >
> > > > Then you can also add tools like Phoenix and Impala in the picture
> > which
> > > > will allow you to query the data from HDFS or HBase too.
> > > >
> > > > A good way to know if HBase is a good fit or not is to ask yourself
> how
> > > you
> > > > are going to write into HBase or to read from HBase. HBase is good
> for
> > > > Random Reads and Random Writes. If you only do bulk loads and
> > > aggregations
> > > > (Full table scan), HBase is not a good fit. If you do random access
> > > (Client
> > > > information, events details, etc.) HBase is a good fit.
> > > >
> > > > It's a bit over simplified, but that should give you some starting
> > > points.
> > > >
> > > >
> > > > 2014-04-30 4:34 GMT-04:00 Shushant Arora <shushantarora09@gmail.com
> >:
> > > >
> > > > > I have a requirement of processing huge weblogs on daily basis.
> > > > >
> > > > > 1. data will come incremental to datastore on daily basis and I
>  need
> > > > > cumulative and daily
> > > > > distinct user count from logs and after that aggregated data will
> be
> > > > loaded
> > > > > in RDBMS like mydql.
> > > > >
> > > > > 2.data will be loaded in hdfs datawarehouse on daily basis and same
> > > will
> > > > be
> > > > > fetched from Hdfs warehouse after some filtering in RDMS like mysql
> > and
> > > > > will be processed there.
> > > > >
> > > > > Which datawarehouse is suitable for approach 1 and 2 and why?.
> > > > >
> > > > > Thanks
> > > > > Shushant
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message