hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: data structure
Date Thu, 14 Jul 2011 21:17:20 GMT
You can play tricks with the arrangement of the key.

For instance, you can put date at the end of the key.  That would let you
pull data for a particular user for a particular date range.  The date
should not be a time stamp, but should be a low-res version of time
(day-level resolution might be ok) so that you can minimize number of rows.

On Thu, Jul 14, 2011 at 12:52 PM, Andre Reiter <a.reiter@web.de> wrote:

> Hi everybody,
> we have our hadoop + hbase cluster running at the moment with 6 servers
> everything is working just fine. We have a web application, where data is
> stored with the row key = user id (meaningless UUID). So our users have a
> cookie, which is the row key, behind this key are families with items, i.e.
> family "impressions", where every impression is stored with its time stamp
> etc...
> the row key is defined with the user id, to make the real time request
> possible, so we can retrieve all user data very fast
> new we are running mapreduce jobs, to generate a report: for example we
> want to know how many impressions were done by all users in last x days.
> therefore the scan of the MR job is running over all data in our hbase table
> for the particular family. this takes at the moment about 70 seconds, which
> is actually a bit too long, and with the data growing, the time will
> increase, unless we add new workers to the cluster. we have right now 22
> regions
> the problem i see, is that we can not define a filter for the scan, the row
> key (user id) is just an UUID, nothing meaningfull in it
> what can we do, to however improve (accelerate) the scan process? is it
> maybe advisable to store the data more redundant. so for example we create
> second table and store every impression twice, one time with the user id as
> row key in the first table, and the second one with a time stamp as a row
> key in the second table.
> the data volume would grow twice as fast, but our scans will work x times
> faster on the second table compared to now
> comments are very appreciated
> andre

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message