hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prosperent <prospere...@gmail.com>
Subject RE: hbase architecture question
Date Tue, 12 Apr 2011 16:38:59 GMT

The plan was to have the map reduce job run on our schedule (hourly, daily,
monthly) and populate these rollups so we aren't having to do any processing
on the data in hbase. When a user requests stats, we just pull back the
already compiled data from the rollups. It isn't realtime this way, but we
avoid the i/o issues that you pointed out. 

Lyman Do wrote:
> It depends on how many concurrent users on the BI frond end, if each of
> them will fire off a MR job for their BI queries, which likely resulting
> in a scan or partial scan on HBase, this may put too much stress on the IO
> sub-system.  
> If you have the data access pattern of your BI users, you may want to
> pre-aggregate some into MySQL in form of a data mart which is more
> flexible for slice and dice queries. Leave the MR jobs for ad hoc and non-
> or semi-aggregate data analysis.
> -----Original Message-----
> From: Prosperent [mailto:prosperent1@gmail.com] 
> Sent: Monday, April 11, 2011 3:10 PM
> To: hbase-user@hadoop.apache.org
> Subject: hbase architecture question
> We're new to hbase, but somewhat familiar with the core concepts
> associated
> with it. We use mysql now, but have also used cassandra for portions of
> our
> code. We feel that hbase is a better fit because of the tight integration
> with mapreduce and the proven stability of the underlying hadoop system. 
> We run an advertising network in which we collect several thousand pieces
> of
> analytical data per second. This obviously scales poorly in mysql. Our
> initial gut feeling is to do something like the following with hbase. Let
> me
> know if we are on the right track.
> Aggregate our detailed raw stats into hbase that contain all of our
> verbose
> data. From here, we can run mapreduce jobs and create hourly, daily,
> monthly, etc rollups of our data as it is needed for our different front
> end
> interfaces. Store it in such a way that it is formatted how we need it so
> we
> don't have to do any further processing on it when we hit display time.
> This
> would also give us the flexibility to create new views with new rollup
> metrics since we stored all of our raw data and can again mapreduce it
> anyway we need it. 
> For simple graphs and a more realtime view of simple data like clicks and
> impressions we thought about simply auto incrementing hourly, daily,
> monthly
> counters for a user or revenue channel. 
> The other consideration is getting the data into hbase. We were looking at
> adding variables to our url's so we can aggregate the apache logs from
> each
> of our front end application servers. That or we can simply do the inserts
> straight into hbase using php and thrift. I'm guessing the first scenario
> is
> more efficient speed wise, but again, I may be overlooking other issues.
> Does this basic data strategy sound solid? Any suggestions, or potential
> pitfalls? I would love some advice from those more seasoned in handling
> large volume analytical datasets. 
> Thanks guys
> Brian
> -- 
> View this message in context:
> http://old.nabble.com/hbase-architecture-question-tp31374398p31374398.html
> Sent from the HBase User mailing list archive at Nabble.com.

View this message in context: http://old.nabble.com/hbase-architecture-question-tp31374398p31380826.html
Sent from the HBase User mailing list archive at Nabble.com.

View raw message