hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Sprick <fspr...@gmail.com>
Subject Hbase row key & MapReduce
Date Tue, 01 Mar 2011 16:40:20 GMT
Hi everyone,

I have a question regarding the design of the row key for a Hbase table. I
am working with a system storing hundrets of values up to 50 times per
second over a period of several month. I want to run MapReduce jobs on this
data performing simple calculations for each row within a certain period of
time (usually hours but potentially also days and weeks). MapReduce because
it would allow us to run this simple calucation in parallel in the cluster.

How do I manage to have the data distributed over the Hbase cluster so that
the MapReduce calculation involves as many nodes as possible? If I use the
timestamp as row-key I would end up with all data on one/few machines and
run into hotspotting issues plus the MapReduce job would only run on a
subset of all machines in the cluster. If I invert the timestamp and
use this as the row-key I have the data distributed more evenly and
MapReduce jobs could run on several machines. Problem then is that I wouldnt
be able to restrict the input to the MapReduce scan with startRow/stopRow
filters on the scan because rows belonging to one time frame wouldnt be
stored sequentelly any longer. Or is MapReduce designed in a way that I
always have to walk through the entire database row by row?

Any ideas?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message