hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tarnas <...@email.com>
Subject Re: EC2 + Thrift inserts
Date Sat, 01 May 2010 00:23:28 GMT
Thanks, that would be great.

Actually the code is perl, I'm using streaming to do the map-reduce (bioinformatics data that
we have lots of perl libraries for). So far on a single thread it works quite well (in house
we get ~300 rows/sec, on EC2 maybe half that with indexes), usually with the perl being the
bottle neck and HBase just soaking it up. We were hoping to throw more CPUs at it to increase
the load speed.

-chris

On Apr 30, 2010, at 5:01 PM, Jean-Daniel Cryans wrote:

> Not sure why you are going through thrift if you are already using
> java (you want to test thrift's speed because java isn't your main dev
> language?) but it will maybe add 1ms or 2, really not that bad. Here
> at StumbleUpon we use thrift to get our php website to talk to HBase
> and on average we stay under 10ms for random gets. Our machines are
> 2xi7, 24GB, 4x1TB sata.
> 
> My coworker (Stack) pinged the author of the contrib to see if he can
> make a patch for your issue.
> 
> J-D
> 
> On Fri, Apr 30, 2010 at 4:51 PM, Chris Tarnas <cft@email.com> wrote:
>> 
>> On Apr 30, 2010, at 4:44 PM, Jean-Daniel Cryans wrote:
>> 
>>> On Fri, Apr 30, 2010 at 4:32 PM, Chris Tarnas <cft@email.com> wrote:
>>>> 
>>>> 
>>>> I'm also using thrift to connect and am wondering if that itself puts an
overall limit on scaling? It does seem that no matter how many more mappers and servers I
add, even without indexing, I am capped at about 5k rows/sec total. I'm waiting a bit as the
table grows so that it is split across more regionservers, hopefully that will help, but as
far as I can tell I am not hitting any CPU or IO constraint during my tests.
>>> 
>>> I don't understand the "I'm also using thrift" and "how many more
>>> mappers" part, you are using Thrift inside a map? Anyways, more
>>> clients won't help since there's a single mega serialization of all
>>> the inserts to the index table per region server. It's normal not to
>>> see any CPU/mem/IO contention since, in this case, it's all about the
>>> speed at which you can process a single row insertion The rest of the
>>> threads just wait...
>>> 
>> 
>> Sorry - should have been more clear. I'm testing now with a normal tables and regionservers
and I seem to cap out at about 5-7k rows a second for inserts. My method for doing inserts
is to use map reduce on hadoop to launch many insert processes, each process uses the local
thrift server on each node to connect to hbase. In this case I hope that other threads can
insert at the same time.
>> 
>> -chris
>> 
>> 
>> 


Mime
View raw message