hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Echeverria <j...@cloudera.com>
Subject Re: High Throughput using row keys based on the current time
Date Fri, 28 Oct 2011 14:34:52 GMT
Have you looked into bulk imports? You can write your data into HDFS
and then run a MapReduce job to generate the files that HBase uses to
serve data. After the job finishes, there's a utility to copy the
files into HBase's directory and your data is visible. Check out
http://hbase.apache.org/bulk-loads.html for details.

-Joey

On Fri, Oct 28, 2011 at 10:08 AM, Andreas Reiter <a.reiter@web.de> wrote:
> Hi everybody,
>
> we have the following scenario:
> our clustered web application needs to write records to hbase, we need to
> support a very high throughput, we expect up to 10-30 thousends requests per
> second and may be even more
>
> so usually it is not a problem for HBase, if we use a "random" row key; in
> this case the data is distributed between all region servers equally
> but, we need to generate our keys based on the current time, so we are able
> to run MR jobs for a period of time without processing the whole data, using
>  scan.setStartRow(stopRow);
>  scan.setStopRow(startRow);
>
> in our case the generated row keys look similar and are there for going to
> the same region server... so this approach is not really using the power of
> the whole cluster, but only one server, which can be dangerous in case of a
> very high load
>
> so, we are thinking about writing the records first to a HDFS file, and run
> additionally a MR job periodically to read the finnished HDFS files and
> insert the records to HBase
>
> what do you guys think about it? any suggestions would be very appreciated
>
> regards
> andre
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Mime
View raw message