Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@locus.apache.org Received: (qmail 23311 invoked from network); 8 Feb 2008 06:53:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Feb 2008 06:53:08 -0000 Received: (qmail 9456 invoked by uid 500); 8 Feb 2008 06:53:00 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 9434 invoked by uid 500); 8 Feb 2008 06:53:00 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 9425 invoked by uid 99); 8 Feb 2008 06:53:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Feb 2008 22:53:00 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO blingymail-a1.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Feb 2008 06:52:44 +0000 Received: from [192.168.1.62] (unknown [207.115.108.202]) by blingymail-a1.g.dreamhost.com (Postfix) with ESMTP id 2D1725CD44 for ; Thu, 7 Feb 2008 22:52:35 -0800 (PST) Message-Id: <1999BE24-7925-4F17-B717-F7F1C41F56F7@openplaces.com> From: Joost Ouwerkerk To: hbase-user@hadoop.apache.org In-Reply-To: <47ABE320.9050304@duboce.net> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v915) Subject: Re: HBase Random Read Performance Date: Fri, 8 Feb 2008 01:52:33 -0500 References: <47ABCC44.20208@openplaces.com> <47ABE320.9050304@duboce.net> X-Mailer: Apple Mail (2.915) X-Virus-Checked: Checked by ClamAV on apache.org Current setup is three machines, one of which doubles as the master, using distributed hdfs. One million rows /1 col was just a test -- definitely need to scale way beyond that, at which point MySQL will break down as a viable option. Besides appeal of MapReduce for offline processing, multi-column is also definitely a requirement, and an obvious next step for benchmarking. Actually looking now at how to bulk load data properly, since it took hours to load 1 million rows from client with lock/put/commit for every row, whereas PerformanceEvaluation can do this in about 15 minutes in single client. BTW, running PerformanceEvaluation randomRead with 5 clients (MR) I get 1,687 reads/sec if I'm reading the results correctly: 08/02/08 01:14:40 INFO mapred.JobClient: Job complete: job_200802042127_0001 08/02/08 01:14:40 INFO mapred.JobClient: Counters: 12 08/02/08 01:14:40 INFO mapred.JobClient: HBase Performance Evaluation 08/02/08 01:14:40 INFO mapred.JobClient: Elapsed time in milliseconds=3107646 08/02/08 01:14:40 INFO mapred.JobClient: Row count=5242850 08/02/08 01:14:40 INFO mapred.JobClient: Job Counters 08/02/08 01:14:40 INFO mapred.JobClient: Launched map tasks=54 08/02/08 01:14:40 INFO mapred.JobClient: Launched reduce tasks=1 08/02/08 01:14:40 INFO mapred.JobClient: Data-local map tasks=51 08/02/08 01:14:40 INFO mapred.JobClient: Map-Reduce Framework 08/02/08 01:14:40 INFO mapred.JobClient: Map input records=50 08/02/08 01:14:40 INFO mapred.JobClient: Map output records=50 08/02/08 01:14:40 INFO mapred.JobClient: Map input bytes=3634 08/02/08 01:14:40 INFO mapred.JobClient: Map output bytes=700 08/02/08 01:14:40 INFO mapred.JobClient: Reduce input groups=50 08/02/08 01:14:40 INFO mapred.JobClient: Reduce input records=50 08/02/08 01:14:40 INFO mapred.JobClient: Reduce output records=50 Joost. On 8-Feb-08, at 12:05 AM, stack wrote: > Tthe test described can only favor mysql (single column, just a > million rows). Do you need Hbase? > You might also tell us more about your hbase setup. Is it using > localfs or hdfs? Is it a distributed hdfs or all on single server? > > Thanks, > St.Ack > > > > > > Joost Ouwerkerk wrote: >> I'm working on a web application with primarily read-oriented >> performance requirements. I've been running some benchmarking >> tests that include our application layer, to get a sense of what is >> possible with Hbase. A variation on the Bigtable test that is >> reproduced by org.apache.hadoop.hbase.PerformanceEvaluation, I'm >> randomly reading 1 column from a table with 1 million rows. In our >> case, the contents of that column need to be deserialized by our >> application (which adds some overhead that I'm also trying to >> measure), the deserialized contents represent a little over 1K of >> data. >> Although a single thread can only achieve 125 reads per second, >> with 12 client threads (from 3 different machines) I'm able to read >> as many as 500 objects per second. Now, I've replicated my test on >> a basic MySQL table and am able to get a throughput of 2,300 reads/ >> sec; roughly 5 times what I'm seeing with Hbase. Besides the >> obvious code maturity thing, is the discrepancy related to random >> reads not actually being served from memcache, but rather from the >> disk, by Hbase? The HBase performance page (http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation >> ) shows random reads(mem) as "Not implemented." >> >> Can anyone shed some light on the state of HBase's memcaching? >> Cheers, >> Joost. >