Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6654FD955 for ; Fri, 24 Aug 2012 19:27:16 +0000 (UTC) Received: (qmail 28694 invoked by uid 500); 24 Aug 2012 19:27:14 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 28624 invoked by uid 500); 24 Aug 2012 19:27:14 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 28616 invoked by uid 500); 24 Aug 2012 19:27:14 -0000 Delivered-To: apmail-hadoop-hbase-user@hadoop.apache.org Received: (qmail 28612 invoked by uid 99); 24 Aug 2012 19:27:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Aug 2012 19:27:14 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.26 as permitted sender) Received: from [216.139.236.26] (HELO sam.nabble.com) (216.139.236.26) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Aug 2012 19:27:09 +0000 Received: from telerig.nabble.com ([192.168.236.162]) by sam.nabble.com with esmtp (Exim 4.72) (envelope-from ) id 1T4zWq-00082c-3L for hbase-user@hadoop.apache.org; Fri, 24 Aug 2012 12:26:48 -0700 Message-ID: <34345744.post@talk.nabble.com> Date: Fri, 24 Aug 2012 12:26:48 -0700 (PDT) From: Marc Sturlese To: hbase-user@hadoop.apache.org Subject: RS, TT, shared DN and good performance on random Hbase random reads. MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: marc.sturlese@gmail.com X-Virus-Checked: Checked by ClamAV on apache.org Hey there, I am wondering if this is a good practice: I have a 10 nodes cluster, running datanodes and tasktrackers, and continuously running MR jobs. My replication factor is 3. I need to put the results of a couple of jobs into Hbase tables to be able to do random seek search. The Hbase tables would be almost just for reading, just with a few additions. They would almost act as a view and would be build every 5 hours. I want to minimize the impact of the MR jobs that are running on the cluster to the random hbase reads. My idea is: -Keep 10 nodes with datanodes and tasktrackers -Add 2 nodes (the data to save into hbase is smaller compared to all the data of the cluster) with datanode, and RS -run bulk import creating HFiles (for a pre-splited table) and then manually run compaction (would be deactivated by default) The reasons for that would be: -After running full compaction, HFiles end up in the RS nodes, so would achieve data locality. -As I have replication factor 3 and just 2 Hbase nodes, I know that no map task would try to read in the RS nodes. The reduce tasks will write first in the node where they exist (which will never be a RS node). -So, in the RS I would end up having the Hbase tables and block replicas of the MR jobs that will never be read (as Maps do data locality and at least a replica of each block will be in a MR node) In case this would work, if I add more nodes with RS and datanode, could I guarantee that no map task would ever read in them? (assuming that a reduce task always writes first in the node where it exists, correct me if I'm wrong please as I'm not sure about this). Probably I've done some wrong assumptions here. Would this be a good way to achieve my goal? In case not, and advices (not counting splitting in 2 different clusters) -- View this message in context: http://old.nabble.com/RS%2C-TT%2C-shared-DN-and-good-performance-on-random-Hbase-random-reads.-tp34345744p34345744.html Sent from the HBase User mailing list archive at Nabble.com.