hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Harvey <dan.har...@mendeley.com>
Subject Re: help for designing a hbase
Date Tue, 15 Jun 2010 21:21:19 GMT
Hey Johannes,

We're using hbase for something similar at Mendeley. We store all our raw
http logs in hdfs then use pig scripts to process these into hits per day
for each of our articles. We store this as follows into a hbase table :-

articleId_date => counts:total:30, counts:unqiue:10, ...

So the key is both a combination of the articleId and date of the hits and
we use a single column family with any number of qualifiers we wish to store
stats for, like total, unique etc..

What we can then do with this is run map/reduce jobs in hadoop over this to
gather aggregated statistics as you would like to do. So to find the total
hits for an article over a week we map the articleId_date keys to
articleId_week then sum the hits at the reducer for each of the keys we get.
Using this type of idea you can get a wide range of aggregated data  and
statistics for a large number of hits quite easily.

You could also store the hits directly into hbase if you like using
something like

page_timestamp_unqiueId => hit:source_ip, hit:response_time, hit:browser,

The problem with this is you'll have to be careful to use a uniqueId based
on some random data at the end in case you get two users going to the same
page at the same time, so it might be better just to store the post
processed data in hbase instead to do analytics with, I
would definitely recommend looking at pig if you want to do that.

There are probably also many other ways you could use the column families
and their qualifiers to store hits but I think you'll find the hit per row
the best to scale storing large numbers of hits with in hbase.

Hope that helps,

On 14 June 2010 19:11, Johannes Weissensel <whitesensless@googlemail.com>wrote:

> Hi everyone,
> i am new to nosql databases and especially column-oriented Databases
> like hbase.
> I am a student on information-systems and i evaluate a fitting no-sql
> database for a web analytics system. Got the use-case of data like
> webserver-logfile.
> in an RDBMS it would be for every hit a row in the database, and than
> endless grouping and counting on the data for getting the metrics you
> want.
> Is there anyone who has experiences with data like that in hypertable,
> how should i design the database?
> Also for every hit a single row, or maybe for every session an
> aggregated version of the data, or for every day and every page a
> single aggregated version.
> Maybe some has an idea, how to design the database? Just like an
> typical not normalized sql database?
> Hope you have some ideas :)
> Johannes

Dan Harvey | Datamining Engineer

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message