Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 76882 invoked from network); 4 Jun 2010 22:24:25 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 Jun 2010 22:24:25 -0000 Received: (qmail 83122 invoked by uid 500); 4 Jun 2010 22:24:24 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 83093 invoked by uid 500); 4 Jun 2010 22:24:24 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 83085 invoked by uid 99); 4 Jun 2010 22:24:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jun 2010 22:24:24 +0000 X-ASF-Spam-Status: No, hits=4.2 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tatsuya6502@gmail.com designates 74.125.83.169 as permitted sender) Received: from [74.125.83.169] (HELO mail-pv0-f169.google.com) (74.125.83.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jun 2010 22:24:18 +0000 Received: by pva18 with SMTP id 18so911885pva.14 for ; Fri, 04 Jun 2010 15:23:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:content-type:mime-version :subject:from:in-reply-to:date:content-transfer-encoding:message-id :references:to:x-mailer; bh=MdxWqU9krywWH9Sm+KQ0/+SVkpituosqhKrHfJzBtKg=; b=G030UJrWqT8O76bJ3K2FrbLn9ktcPb+X5Nuln2I1Lr/aMfum6bU16VTtz/fPKcybf3 AKxvMDsopPFDWeoHF+JoIAO046S7QbBn7MtjSg5IpyDlO6PRu93KCaWTJMwdefZKPgN/ IwnL4HC0kuc29JRK+kVh5o0xNGv0h4TAqddxM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; b=WnHX3ySzkh7z23o7Np4BAUzwCgKM05ASIfUX+rSMEKHVOvKoOYj8JY7+e3AeC94yN1 GeXmACBMIaiWd4+nRz+txMIGHsPCTLQzObZ5vR21K4PvW6VjlMhHdu3x0W7jo2cGZg8A Lv0Ay2IQ1m8kcbTfYSetzqfOnWqgvNXmwO0LM= Received: by 10.115.39.24 with SMTP id r24mr9152616waj.166.1275690237943; Fri, 04 Jun 2010 15:23:57 -0700 (PDT) Received: from [172.16.80.75] (NE1618lan1.rev.em-net.ne.jp [61.117.211.1]) by mx.google.com with ESMTPS id c14sm12370381waa.13.2010.06.04.15.23.56 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 04 Jun 2010 15:23:56 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1078) Subject: Re: Adding a tiny HBase cluster to existing Hadoop environment From: Tatsuya Kawano In-Reply-To: Date: Sat, 5 Jun 2010 07:23:54 +0900 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: user@hbase.apache.org X-Mailer: Apple Mail (2.1078) Hi Todd,=20 Thanks for answering my question.=20 > On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano wrote: >> I remember Jon was talking other day that he was trying a single = HBase >> server with existing HDFS cluster to serve map reduce (MR) results. I = wonder >> if this went well or not. >> So I'm thinking to recommend them to add just one server (non-HA) or = two >> servers (HA) to their Hadoop cluster, and run only HMaster and Region = Server >> processes on the server(s). The HBase cluster will utilize the = existing >> (small or large) HDFS cluster and ZooKeeper ensemble. I went back to the mailing list archive and found that the information I = needed was already there; Jon wrote down pros and cons in a similar = configuration.=20 RE: HBase on 1 box? how big? = http://markmail.org/thread/3yfoou4gna2fex5f#query:+page:1+mid:4m27ay3mwuh2= a5vu+state:results On 06/04/2010, at 9:37 AM, Todd Lipcon wrote: > If your "exported dataset" from the MR job is small enough to fit on = one > server, you can certainly use a single HBase RS plus the bulk load > functionality. However, with a small dataset like that it might make = more > sense to simply export TSV/CSV and then use a tool like Sqoop to = export to a > relational database. That way you'd have better off the shelf = integration > with various other tools or access methods. Thanks for the suggestion. In this particular configuration, I'm = expecting one RS can handle far larger dataset than typical HBase = configuration. The dataset is read-only, so all memstores will be empty. = This leaves more room on the RAM, and the RS could take more regions = than usual. Also, the RS is backed by the current HDFS installation. The = larger cluster has more than 50 Data Nodes, and this could give the RS = better concurrent random read capacity than a single node RDB with local = hard drives. =20 I talked to the guys last night, and one of the guys is also evaluating = RDBs (Sybase, Oracle and MySQL). His current concern is loading the = large dataset to RDB is time consuming. He's going to try the native = import utilities for the RDBs, and Sqoop is on his list too. (He = attended Cloudera Hadoop training in Tokyo.) But he also wants to try = HBase as another option because it has better MR integration.=20 >> Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was = said that >> I'd better to have at least 5 Region Servers / Data Nodes in my = cluster to >> get the typical performance. If I deploy RS and DN on separate = servers, >> which one should be >=3D 5 nodes? DN? RS? or both? >>=20 >>=20 > Better to colocate the DNs and RSs for most deployments. You get > significantly better random read performance for uncached data. If I could build the cluster from a scratch, I would suggest so. The = difficult part of my case is the current installations (50+ servers) are = not intended to deploy RSs. I need to add more processor cores and RAM = to the current servers to make reliable Task Tracker + DN + RS nodes. = Also, it's obvious I don't need all 50+ servers to have RS, so maybe = five of them? But having only five region servers on 50+ data nodes = results the HDFS data blocks unevenly distributed across the cluster. = This won't be an optimal solution.=20 So, in this particular case, I'd rather separate RSs from the DNs to = make the data blocks evenly distributed. I'm not sure if this causes bad = performance on random read, because the network latency in today's = hardware is good enough (average 0.1 ms) compared to the server-class = 15,000 RPM hard drives (5 ms). The only drawback I can think of is = network congestion when doing massive writes and scans, but my case = doesn't do such operations.=20 It was good to know that having less than five region servers is not a = bad idea (as long as you have enough number of HDFS data nodes). You and = Jon's email gave me some information about things to avoid, and one of = my friends is evaluating RDBs as well.=20 Thanks,=20 Tatsuya On 06/04/2010, at 9:37 AM, Todd Lipcon wrote: > Hi Tatsuya, >=20 > On Thu, Jun 3, 2010 at 5:06 PM, Tatsuya Kawano = wrote: >=20 >> Hello, >>=20 >> I remember Jon was talking other day that he was trying a single = HBase >> server with existing HDFS cluster to serve map reduce (MR) results. I = wonder >> if this went well or not. >>=20 >> A couple of friends in Tokyo are considering HBase to do a similar = thing. >> They want to serve MR results inside the clients' companies via = HBase. They >> both have existing MR/HDFS emvironment; one has a small (< 10) and = another >> has a large (> 50) clusters. >>=20 >> They'll use the incremental loading to existing table (HBASE-1923) to = add >> the MR results to the HBase table, and only few users will read and = export >> (web CSV download) the results via HBase. So HBase will be lightly = loaded. >> They probably won't even need high availability (HA) option on HBase. >>=20 >> So I'm thinking to recommend them to add just one server (non-HA) or = two >> servers (HA) to their Hadoop cluster, and run only HMaster and Region = Server >> processes on the server(s). The HBase cluster will utilize the = existing >> (small or large) HDFS cluster and ZooKeeper ensemble. >>=20 >>=20 > If your "exported dataset" from the MR job is small enough to fit on = one > server, you can certainly use a single HBase RS plus the bulk load > functionality. However, with a small dataset like that it might make = more > sense to simply export TSV/CSV and then use a tool like Sqoop to = export to a > relational database. That way you'd have better off the shelf = integration > with various other tools or access methods. >=20 >=20 >> The server spec will be 2 x 8-core processors and 8GB to 24GB RAM. = The RAM >> size will be change depending on the data volume and access pattern. >>=20 >> Has anybody tried a similar configuration? and how it goes? >>=20 >>=20 >> Also, I saw Jon's slides for Hadoop World in NYC 2009, and it was = said that >> I'd better to have at least 5 Region Servers / Data Nodes in my = cluster to >> get the typical performance. If I deploy RS and DN on separate = servers, >> which one should be >=3D 5 nodes? DN? RS? or both? >>=20 >>=20 > Better to colocate the DNs and RSs for most deployments. You get > significantly better random read performance for uncached data. >=20 > -Todd >=20 >=20 >>=20 >> Thanks, >> Tatsuya Kawano >> Tokyo, Japan >>=20 >>=20 >>=20 >>=20 >=20 >=20 > --=20 > Todd Lipcon > Software Engineer, Cloudera