hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Reiter <a.rei...@web.de>
Subject High Throughput using row keys based on the current time
Date Fri, 28 Oct 2011 14:08:17 GMT
Hi everybody,

we have the following scenario:
our clustered web application needs to write records to hbase, we need to support a very high
throughput, we expect up to 10-30 thousends requests per second and may be even more

so usually it is not a problem for HBase, if we use a "random" row key; in this case the data
is distributed between all region servers equally
but, we need to generate our keys based on the current time, so we are able to run MR jobs
for a period of time without processing the whole data, using
   scan.setStartRow(stopRow);
   scan.setStopRow(startRow);

in our case the generated row keys look similar and are there for going to the same region
server... so this approach is not really using the power of the whole cluster, but only one
server, which can be dangerous in case of a very high load

so, we are thinking about writing the records first to a HDFS file, and run additionally a
MR job periodically to read the finnished HDFS files and insert the records to HBase

what do you guys think about it? any suggestions would be very appreciated

regards
andre

Mime
View raw message