hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andre Reiter <a.rei...@web.de>
Subject data structure
Date Thu, 14 Jul 2011 19:52:05 GMT
Hi everybody,

we have our hadoop + hbase cluster running at the moment with 6 servers

everything is working just fine. We have a web application, where data is stored with the
row key = user id (meaningless UUID). So our users have a cookie, which is the row key, behind
this key are families with items, i.e. family "impressions", where every impression is stored
with its time stamp etc...

the row key is defined with the user id, to make the real time request possible, so we can
retrieve all user data very fast

new we are running mapreduce jobs, to generate a report: for example we want to know how many
impressions were done by all users in last x days. therefore the scan of the MR job is running
over all data in our hbase table for the particular family. this takes at the moment about
70 seconds, which is actually a bit too long, and with the data growing, the time will increase,
unless we add new workers to the cluster. we have right now 22 regions

the problem i see, is that we can not define a filter for the scan, the row key (user id)
is just an UUID, nothing meaningfull in it

what can we do, to however improve (accelerate) the scan process? is it maybe advisable to
store the data more redundant. so for example we create second table and store every impression
twice, one time with the user id as row key in the first table, and the second one with a
time stamp as a row key in the second table.
the data volume would grow twice as fast, but our scans will work x times faster on the second
table compared to now

comments are very appreciated

andre


Mime
View raw message