Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@locus.apache.org Received: (qmail 69009 invoked from network); 1 May 2008 00:31:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 May 2008 00:31:32 -0000 Received: (qmail 29164 invoked by uid 500); 1 May 2008 00:31:33 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 29150 invoked by uid 500); 1 May 2008 00:31:33 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 29139 invoked by uid 99); 1 May 2008 00:31:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2008 17:31:33 -0700 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [216.86.168.178] (HELO mxout-03.mxes.net) (216.86.168.178) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 May 2008 00:30:39 +0000 Received: from [192.168.10.100] (unknown [24.6.146.30]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTP id DCC7723E405 for ; Wed, 30 Apr 2008 20:30:58 -0400 (EDT) Message-Id: From: Chris K Wensel To: hbase-user@hadoop.apache.org In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: comparing hbase backed by HDFS verses S3 Date: Wed, 30 Apr 2008 17:30:57 -0700 References: X-Mailer: Apple Mail (2.919.2) X-Virus-Checked: Checked by ClamAV on apache.org Anything relating to S3 will be slower thus it probably shouldn't be used as the default FileSystem for Hadoop. It works great if you need to park data between cluster runs, assuming you do not need external (from Hadoop and the cluster) applications to be able to read the data, as data in S3FS are stuffed into S3 as blocks (similar to HDFS). Further, once support for appends is added to Hadoop/HDFS, I am unsure if it will be inherited by S3FS. I think this is a critical issue for HBase. Assuming your aren't expecting this cluster to live forever, maybe you should keep your authoritative data on s3 (native or s3fs) and just reload HBase on cluster init? ckw Chris K Wensel chris@wensel.net http://chris.wensel.net/ http://www.cascading.org/ On Apr 30, 2008, at 1:02 PM, Clint Morgan wrote: > We are considering using S3 as the DFS impl for hbase. I ran some > benchmarks to get an idea for the performance differences. We are > particularly interested in being able to serve data to users from > hbase, so want low latency responses for getting 10s of rows. > > Each row ("transaction") has about 1K worth of data in about 5 columns > in two families. I'm using HBASE-605 to maintain a secondary index on > the transaction amount. There is also a "relation" to a customer > table, so some reads will also do a get from this other table. > > First ran hbase backed by hdfs. Everything was run on EC2 small nodes. > 1 node for Name node, 1 node for Data > node, 1 node with Master and Region server, 1 node to load/read data > from. > > Adding 50K transactions: [56610.166]ms > Find all transactions: [35388.601]ms > FindAll page 1: [125.058]ms (PageSize is 10) > FindAll page 11: [71.89]ms > FindAll page 51: [145.54]ms > FindAll page 61: [268.486]ms > > FindAll sorted page 1: [139.881]ms > FindAll sorted page 11: [1521.655]ms > FindAll sorted page 21: [2729.641]ms > FindAll sorted page 31: [3035.18]ms > > Then I ran hbase backed by s3. Everything else the same > > Adding 50K transaction: [104826.437]ms > Findall transaction: [51622.039]ms > Findall page 1: [5694.974]ms > Findall page 11: [4878.234]ms > Findall page 51: [5743.882]ms > Findall page 61: [4167.133]ms > > Findall sorted page 1: [18535.306]ms > Then the other sorted finds timed out on the RPC call. > > So to summarize: > loading data: almost twice as slow > A long scan is about 1.5 times slower > short scans are over an order of magnitude slower > and random reads (done on the sorted "scan") are over 2 orders of > magnitude slower > > Do these results sound reasonable? Is S3 really that costly compared > to HDFS? Thanks for your input. > -clint