hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atif Khan <atif_ijaz_k...@hotmail.com>
Subject Re: Shared Cluster between HBase and MapReduce
Date Thu, 07 Jun 2012 05:38:14 GMT

Thanks for the confirmation.  There is also a good/detailed discussion thread
on this issue found at 
http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-td4018856.html
http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-td4018856.html
.


Michael Segel-3 wrote:
> 
> It depends... There are some reasons to do this however in general, you
> don't need to do this... 
> 
> The course is wrong to suggest this as a best practice.
> 
> Sent from my iPhone
> 
> On Jun 5, 2012, at 5:00 PM, "Atif Khan" <atif_ijaz_khan@hotmail.com>
> wrote:
> 
>> 
>> During a recent Cloudera course we were told that it is "Best practice"
>> to
>> isolate a MapReduce/HDFS cluster from an HBase/HDFS cluster as the two
>> when
>> sharing the same HDFS cluster could lead to performance problems.  I am
>> not
>> sure if this is entirely true given the fact that the main concept behind
>> Hadoop is to export computation to the data and not import data to the
>> computation.  If I were to segregate HBase and MapReduce clusters, then
>> when
>> using MapReduce on HBase data would I not have to transfer large amounts
>> of
>> data from HBase/HDFS cluster to MapReduce/HDFS cluster?
>> 
>> Cloudera on their best practice page
>> (http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) has the
>> following:
>> "Be careful when running mixed workloads on an HBase cluster. When you
>> have
>> SLAs on HBase access independent of any MapReduce jobs (for example, a
>> transformation in Pig and serving data from HBase) run them on separate
>> clusters. HBase is CPU and Memory intensive with sporadic large
>> sequential
>> I/O access while MapReduce jobs are primarily I/O bound with fixed memory
>> and sporadic CPU. Combined these can lead to unpredictable latencies for
>> HBase and CPU contention between the two. A shared cluster also requires
>> fewer task slots per node to accommodate for HBase CPU requirements
>> (generally half the slots on each node that you would allocate without
>> HBase). Also keep an eye on memory swap. If HBase starts to swap there is
>> a
>> good chance it will miss a heartbeat and get dropped from the cluster. On
>> a
>> busy cluster this may overload another region, causing it to swap and a
>> cascade of failures."
>> 
>> All my initial investigation/reading lead me believe that I should a
>> create
>> a common HDFS cluster and then I can run MapReduce and HBase against the
>> common HDFS cluster.   But from the above Cloudera best practice it seems
>> like I should create two HDFS clusters, one for MapReduce and one for
>> HBase
>> and then move data around when required.  Something does not make sense
>> with
>> this best practice recommendation.
>> 
>> Any thoughts and/or feedback will be much appreciated.
>> 
>> -- 
>> View this message in context:
>> http://old.nabble.com/Shared-Cluster-between-HBase-and-MapReduce-tp33967219p33967219.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Shared-Cluster-between-HBase-and-MapReduce-tp33967219p33973918.html
Sent from the HBase User mailing list archive at Nabble.com.


Mime
View raw message