hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arvind Jayaprakash <w...@anomalizer.net>
Subject Re: data structure
Date Sun, 17 Jul 2011 16:30:00 GMT
On Jul 14, Andre Reiter wrote:
>new we are running mapreduce jobs, to generate a report: for example we
>want to know how many impressions were done by all users in last x
>days. therefore the scan of the MR job is running over all data in our
>hbase table for the particular family. this takes at the moment about
>70 seconds, which is actually a bit too long, and with the data
>growing, the time will increase, unless we add new workers to the
>cluster. we have right now 22 regions

Are you looking for average number of impressions per user in the last
'x' days or total number of impressions across all users in the last 'x'
days? I assume it is the latter.

The only reasonable way is to do frequent rollups (think count for every
minute/hour) and store it for future use. The cost of performing these
rollups wil always be a function of your traffic/data. However, the cost
of retrieving your answer should be fixed for a given 'x' and the size
of the rollup window regardless of how much traffic you see. This way,
your online application (I'm guessing from your latency needs) is
de-linked from raw data volumes.

Mime
View raw message