hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <timrobertson...@gmail.com>
Subject Re: Shared Cluster between HBase and MapReduce
Date Wed, 06 Jun 2012 07:22:25 GMT
Like Amandeep says, it really depends on the access patterns and jobs
running on the cluster.

We are using a single cluster for HBase and MR, with each node running DN,
TT and RS.
We have tried mixed clusters with only some running RS but you start to
suffer from data locality issues during scans.  Our primary access patterns
are a checkOrPut on HBase and a full scan over HBase.

To give you an idea of the impact of data locality for scan performance in
HBase, see the blog [1] I wrote on how we monitored scan performance.
 There is a 10x order of magnitude scanning our HBase tables when you don't
have data locality, and we clearly hit all network limits (the traffic
between scan clients running in mappers calling RS on other machines).  We
have not got our work to production yet, so it is possible we will see
issues when (e.g.) regions start to split, but we'll blog about it if it
comes up.  Whatever you do, we have found Ganglia absolutely critical to
understand what is happening on the cluster, and we use Puppet [2] so we
can quickly test different setups.

Cheers,
Tim

[1]
http://gbif.blogspot.dk/2012/05/optimizing-hbase-mapreduce-scans-for.html
[2] E.g. https://github.com/lfrancke/puppet-cdh but there are others needed
too



On Wed, Jun 6, 2012 at 2:20 AM, Amandeep Khurana <amansk@gmail.com> wrote:

> Atif,
>
> These are general recommendations and definitely change based on the
> access patterns and the way you will be using HBase and MapReduce. In
> general, if you are building a latency sensitive application on top of
> HBase, running a MapReduce job at the same time will impact performance due
> to I/O contention. If your main access patterns is going to be running
> MapReduce over HBase tables, you should absolutely consider collocating the
> two frameworks. Now, these recommendations might change based on the
> resources you have on your nodes (CPU, disk, memory).
>
> Having a single HDFS cluster and using some hosts for HBase and others for
> MapReduce only gets you a common storage fabric. It doesn't solve the
> problem of reading data into MapReduce tasks from remote hosts (region
> servers in this case) and is pretty much the same as having two separate
> clusters. In case of two separate clusters, you'll be running your
> MapReduce jobs to talk to a remote HBase instance. You don't have to export
> data out of that cluster manually onto the MapReduce cluster to run jobs on
> it.
>
> Hope that makes it clearer.
>
> -Amandeep
>
>
> On Tuesday, June 5, 2012 at 5:00 PM, Atif Khan wrote:
>
> >
> > During a recent Cloudera course we were told that it is "Best practice"
> to
> > isolate a MapReduce/HDFS cluster from an HBase/HDFS cluster as the two
> when
> > sharing the same HDFS cluster could lead to performance problems. I am
> not
> > sure if this is entirely true given the fact that the main concept behind
> > Hadoop is to export computation to the data and not import data to the
> > computation. If I were to segregate HBase and MapReduce clusters, then
> when
> > using MapReduce on HBase data would I not have to transfer large amounts
> of
> > data from HBase/HDFS cluster to MapReduce/HDFS cluster?
> >
> > Cloudera on their best practice page
> > (http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) has the
> > following:
> > "Be careful when running mixed workloads on an HBase cluster. When you
> have
> > SLAs on HBase access independent of any MapReduce jobs (for example, a
> > transformation in Pig and serving data from HBase) run them on separate
> > clusters. HBase is CPU and Memory intensive with sporadic large
> sequential
> > I/O access while MapReduce jobs are primarily I/O bound with fixed memory
> > and sporadic CPU. Combined these can lead to unpredictable latencies for
> > HBase and CPU contention between the two. A shared cluster also requires
> > fewer task slots per node to accommodate for HBase CPU requirements
> > (generally half the slots on each node that you would allocate without
> > HBase). Also keep an eye on memory swap. If HBase starts to swap there
> is a
> > good chance it will miss a heartbeat and get dropped from the cluster.
> On a
> > busy cluster this may overload another region, causing it to swap and a
> > cascade of failures."
> >
> > All my initial investigation/reading lead me believe that I should a
> create
> > a common HDFS cluster and then I can run MapReduce and HBase against the
> > common HDFS cluster. But from the above Cloudera best practice it seems
> > like I should create two HDFS clusters, one for MapReduce and one for
> HBase
> > and then move data around when required. Something does not make sense
> with
> > this best practice recommendation.
> >
> > Any thoughts and/or feedback will be much appreciated.
> >
> > --
> > View this message in context:
> http://old.nabble.com/Shared-Cluster-between-HBase-and-MapReduce-tp33967219p33967219.html
> > Sent from the HBase User mailing list archive at Nabble.com (
> http://Nabble.com).
> >
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message