hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atif Khan <atif_ijaz_k...@hotmail.com>
Subject Shared Cluster between HBase and MapReduce
Date Wed, 06 Jun 2012 00:00:28 GMT

During a recent Cloudera course we were told that it is "Best practice" to
isolate a MapReduce/HDFS cluster from an HBase/HDFS cluster as the two when
sharing the same HDFS cluster could lead to performance problems.  I am not
sure if this is entirely true given the fact that the main concept behind
Hadoop is to export computation to the data and not import data to the
computation.  If I were to segregate HBase and MapReduce clusters, then when
using MapReduce on HBase data would I not have to transfer large amounts of
data from HBase/HDFS cluster to MapReduce/HDFS cluster?

Cloudera on their best practice page
(http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) has the
"Be careful when running mixed workloads on an HBase cluster. When you have
SLAs on HBase access independent of any MapReduce jobs (for example, a
transformation in Pig and serving data from HBase) run them on separate
clusters. HBase is CPU and Memory intensive with sporadic large sequential
I/O access while MapReduce jobs are primarily I/O bound with fixed memory
and sporadic CPU. Combined these can lead to unpredictable latencies for
HBase and CPU contention between the two. A shared cluster also requires
fewer task slots per node to accommodate for HBase CPU requirements
(generally half the slots on each node that you would allocate without
HBase). Also keep an eye on memory swap. If HBase starts to swap there is a
good chance it will miss a heartbeat and get dropped from the cluster. On a
busy cluster this may overload another region, causing it to swap and a
cascade of failures."

All my initial investigation/reading lead me believe that I should a create
a common HDFS cluster and then I can run MapReduce and HBase against the
common HDFS cluster.   But from the above Cloudera best practice it seems
like I should create two HDFS clusters, one for MapReduce and one for HBase
and then move data around when required.  Something does not make sense with
this best practice recommendation.

Any thoughts and/or feedback will be much appreciated.

View this message in context: http://old.nabble.com/Shared-Cluster-between-HBase-and-MapReduce-tp33967219p33967219.html
Sent from the HBase User mailing list archive at Nabble.com.

View raw message