hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hegner, Travis" <THeg...@trilliumit.com>
Subject RE: Is the thrift server a likely bottleneck?
Date Thu, 03 Sep 2009 14:49:23 GMT
Hi All,

I've used thrift from PHP, and have done bulk imports (multiple rows in one write). Here is
some pseudo code:

<?php

$mutations = array(
  new Mutation(array('column'=>'cf:q1',value=>'v1')),
  new Mutation(array('column'=>'cf:q2',value=>'v2')),
  new Mutation(array('column'=>'cf:q3',value=>'v3'))
);

$batch[0] = new BatchMutation(array('row'=>'r1', 'mutations'=>$mutations));
$batch[1] = new BatchMutation(array('row'=>'r2', 'mutations'=>$mutations));
$batch[2] = new BatchMutation(array('row'=>'r3', 'mutations'=>$mutations));

$client->mutateRows('Table',$batch);

?>

This will insert 3 rows, with three columns each in (what I assume is) one connection, and
one 'batch' upload. My testing with 130k large rows has not given me any reason to believe
that my assumption is not true.

To answer the bottleneck question, I believe that thrift would eventually become a bottleneck
if under a heavy enough load. You could use a dedicated thrift server to help. A rudimentary
solution may be to launch thrift on each region server and do a simple DNS round robin, but
that won't work as well as Ryan's suggestion. With either of those your still not guaranteed
to connect to the thrift server which houses the requested data locally, I would imagine that
thrift would require significant work in order to provide that functionality.

Another possibility (one that I used) would be to run thrift on a machine outside of the cluster
(ideally, on any/all machines making thrift requests) and then the thrift requests could always
point to 'localhost', and you would be accessing the cluster almost exactly as if you coded
with the native java hbase client. I just copied my entire hbase software and config onto
my workstation, launched thrift, and configured my php script to connect to localhost, instead
of the original cluster-housed thrift server.

In a web environment, you could just run thrift on all of your web servers, and the web requests
would be proxied through the local thrift instance into hbase, so the thrift load/capacity
would scale exactly with your web servers.

Just a thought, Hope this helps,

Travis Hegner
http://www.travishegner.com/

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Thursday, September 03, 2009 7:24 AM
To: hbase-user@hadoop.apache.org; sh@defuze.org
Subject: Re: Is the thrift server a likely bottleneck?

Silvain,

A BatchMutation is for a single row and multiple columns (for that
row) so in the HBase Thrift API you cannot batch insert many rows. In
the Java API the equivalent to BatchMutation is Put (which before was
named BatchInsert but people got confused, just like now).

J-D

On Thu, Sep 3, 2009 at 4:25 AM, Sylvain Hellegouarch<sh@defuze.org> wrote:
>
>> Thrift spawns as many threads as requests, so running more than one
>> shouldn't benefit you much I think?
>
> Being a little unaware of Java's cleverness with threads I cannot really
> say but you're probably right.
>
>>
>> I run 1 thriftserver per regionserver, co existing, and then use
>> TSocketPool on the client side to spread load around.
>>
>> But generally, YES, the thrift server could be a bottleneck.  The main
>> problem with thrift and performance is you cannot control the scanner
>> caching directly, and you cannot do bulk commits.  Both of those would
>> require some API changes, and while not impossible, just hasn't been
>> prioritized.
>
> I'm a little confused then as what is the difference between the bulk
> commit you mention and batch mutations support in the thrift interface.
>
> Moreover, the Hbase 0.20 API is a bit unclear as to when the commit is
> done when using Put. In fact I'm a little unclear as to what is the best
> practice to write lots of rows so that it is as efficient as it can. One
> by one? Batch Mutations?
>
>>
>> Personally, we use thrift for php scripts, and use the Java API for
>> map-reduces and bulk data operations. Thus achieving the best of both
>> worlds: cross language access from PHP and the faster Java-based API
>> for certain scenarios.
>
> We will be using Pig Latin probably for the M/R with a Java adapter to
> fetch rows from HBase. However we do use Python for writing and I'm
> willing to use Jython but that would probably create other dependencies
> issue that I'd be happy to avoid if Thrift is good enough :)
>
> Thanks,
> - Sylvain
>
>
> --
> Sylvain Hellegouarch
> http://www.defuze.org
>

The information contained in this communication is confidential and is intended only for the
use of the named recipient.  Unauthorized use, disclosure, or copying is strictly prohibited
and may be unlawful.  If you have received this communication in error, you should know that
you are bound to confidentiality, and should please immediately notify the sender or our IT
Department at  866.459.4599.

Mime
View raw message