Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates
 216.139.236.26 as permitted sender)
Message-ID: <34345744.post@talk.nabble.com>
Date: Fri, 24 Aug 2012 12:26:48 -0700 (PDT)
From: Marc Sturlese <marc.sturlese@gmail.com>
To: hbase-user@hadoop.apache.org
Subject: RS, TT, shared DN and good performance on random Hbase random
 reads.
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


Hey there, 
I am wondering if this is a good practice:
I have a 10 nodes cluster, running datanodes and tasktrackers, and
continuously running MR jobs.
My replication factor is 3.
I need to put the results of a couple of jobs into Hbase tables to be able
to do random seek search. The Hbase tables would be almost just for reading,
just with a few additions. They would almost act as a view and would be
build every 5 hours. I want to minimize the impact of the MR jobs that are
running on the cluster to the random hbase reads. My idea is:
-Keep 10 nodes with datanodes and tasktrackers
-Add 2 nodes (the data to save into hbase is smaller compared to all the
data of the cluster) with datanode, and RS
-run bulk import creating HFiles (for a pre-splited table) and then manually
run compaction (would be deactivated by default)

The reasons for that would be:
-After running full compaction, HFiles end up in the RS nodes, so would
achieve data locality.
-As I have replication factor 3 and just 2 Hbase nodes, I know that no map
task would try to read in the RS nodes. The reduce tasks will write first in
the node where they exist (which will never be a RS node).
-So, in the RS I would end up having the Hbase tables and block replicas of
the MR jobs that will never be read (as Maps do data locality and at least a
replica of each block will be in a MR node)

In case this would work, if I add more nodes with RS and datanode, could I
guarantee that no map task would ever read in them? (assuming that a reduce
task always writes first in the node where it exists, correct me if I'm
wrong please as I'm not sure about this).

Probably I've done some wrong assumptions here. Would this be a good way to
achieve my goal? In case not, and advices (not counting splitting in 2
different clusters)
-- 
View this message in context: http://old.nabble.com/RS%2C-TT%2C-shared-DN-and-good-performance-on-random-Hbase-random-reads.-tp34345744p34345744.html
Sent from the HBase User mailing list archive at Nabble.com.