Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Message-Id: <1999BE24-7925-4F17-B717-F7F1C41F56F7@openplaces.com>
From: Joost Ouwerkerk <joosto@openplaces.com>
To: hbase-user@hadoop.apache.org
In-Reply-To: <47ABE320.9050304@duboce.net>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v915)
Subject: Re: HBase Random Read Performance
Date: Fri, 8 Feb 2008 01:52:33 -0500
References: <47ABCC44.20208@openplaces.com> <47ABE320.9050304@duboce.net>

Current setup is three machines, one of which doubles as the master,  
using distributed hdfs.  One million rows /1 col was just a test --  
definitely need to scale way beyond that, at which point MySQL will  
break down as a viable option.  Besides appeal of MapReduce for  
offline processing, multi-column is also definitely a requirement, and  
an obvious next step for benchmarking.

Actually looking now at how to bulk load data properly, since it took  
hours to load 1 million rows from client with lock/put/commit for  
every row, whereas PerformanceEvaluation can do this in about 15  
minutes in single client.

BTW, running PerformanceEvaluation randomRead with 5 clients (MR) I  
get 1,687 reads/sec if I'm reading the results correctly:

08/02/08 01:14:40 INFO mapred.JobClient: Job complete:  
job_200802042127_0001
08/02/08 01:14:40 INFO mapred.JobClient: Counters: 12
08/02/08 01:14:40 INFO mapred.JobClient:   HBase Performance Evaluation
08/02/08 01:14:40 INFO mapred.JobClient:     Elapsed time in  
milliseconds=3107646
08/02/08 01:14:40 INFO mapred.JobClient:     Row count=5242850
08/02/08 01:14:40 INFO mapred.JobClient:   Job Counters
08/02/08 01:14:40 INFO mapred.JobClient:     Launched map tasks=54
08/02/08 01:14:40 INFO mapred.JobClient:     Launched reduce tasks=1
08/02/08 01:14:40 INFO mapred.JobClient:     Data-local map tasks=51
08/02/08 01:14:40 INFO mapred.JobClient:   Map-Reduce Framework
08/02/08 01:14:40 INFO mapred.JobClient:     Map input records=50
08/02/08 01:14:40 INFO mapred.JobClient:     Map output records=50
08/02/08 01:14:40 INFO mapred.JobClient:     Map input bytes=3634
08/02/08 01:14:40 INFO mapred.JobClient:     Map output bytes=700
08/02/08 01:14:40 INFO mapred.JobClient:     Reduce input groups=50
08/02/08 01:14:40 INFO mapred.JobClient:     Reduce input records=50
08/02/08 01:14:40 INFO mapred.JobClient:     Reduce output records=50

Joost.

On 8-Feb-08, at 12:05 AM, stack wrote:

> Tthe test described can only favor mysql (single column, just a  
> million rows).  Do you need Hbase?
> You might also tell us more about your hbase setup.  Is it using  
> localfs or hdfs?  Is it a distributed hdfs or all on single server?
>
> Thanks,
> St.Ack
>
>
>
>
>
> Joost Ouwerkerk wrote:
>> I'm working on a web application with primarily read-oriented  
>> performance requirements.  I've been running some benchmarking  
>> tests that include our application layer, to get a sense of what is  
>> possible with Hbase.  A variation on the Bigtable test that is  
>> reproduced by org.apache.hadoop.hbase.PerformanceEvaluation, I'm  
>> randomly reading 1 column from a table with 1 million rows.  In our  
>> case, the contents of that column need to be deserialized by our  
>> application (which adds some overhead that I'm also trying to  
>> measure), the deserialized contents represent a little over 1K of  
>> data.
>> Although a single thread can only achieve 125 reads per second,  
>> with 12 client threads (from 3 different machines) I'm able to read  
>> as many as 500 objects per second.  Now, I've replicated my test on  
>> a basic MySQL table and am able to get a throughput of 2,300 reads/ 
>> sec; roughly 5 times what I'm seeing with Hbase.  Besides the  
>> obvious code maturity thing, is the discrepancy related to random  
>> reads not actually being served from memcache, but rather from the  
>> disk, by Hbase?  The HBase performance page (http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation 
>> ) shows random reads(mem) as "Not implemented."
>>
>> Can anyone shed some light on the state of HBase's memcaching?
>> Cheers,
>> Joost.
>