hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Kellerman (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-1468) Add HBase batch update to reduce RPC overhead
Date Wed, 06 Jun 2007 17:14:25 GMT
Add HBase batch update to reduce RPC overhead
---------------------------------------------

                 Key: HADOOP-1468
                 URL: https://issues.apache.org/jira/browse/HADOOP-1468
             Project: Hadoop
          Issue Type: New Feature
          Components: contrib/hbase
    Affects Versions: 0.14.0
            Reporter: Jim Kellerman
             Fix For: 0.14.0


On Wed, 2007-06-06 at 10:05 -0700, James Kennedy wrote:
Hi,
> 
> I'm noticing that since the HClient/HRegionServer interface only allows 
> for a per-column put(), there is a lot of RPC and some lease management 
> overhead when writing large amounts of data. For example:
> 
>         for (int i = 0; i < 10000; i++) {
>             Text rowKey = new Text(i+"");
>             long lock = client.startUpdate(rowKey);
>             client.put(lock, COL1, rowKey.getBytes());
>             client.put(lock, COL2, someValue.getBytes());
>             client.commit(lock);
>         }
> 
> This code takes my machine (using a single HMaster/HRegionServer on 
> local filesystem) approximately 13 seconds to execute. When i measure 
> the execution time within HRegionServer.put() I get total time spent in 
> put() < 2 seconds. So it looks like there's definately overhead in the 
> RPC communication and serialization/deserialization between client and 
> server. 
> 
> To write 10000 rows, 10000 x (startUpdate=1 +  #cols=2 + commit=1) = 
> 40000 RPC operations.
> 
> What I'm thinking, and please tell me if i'm wrong or if this is already 
> in the works, is that if I create a row-level put() method that submits 
> a map of column values at once, I would reduce the 2 + (#cols) RPC 
> operations to one single atomic row-write RPC as well as eliminate the 
> small but noticeable overhead in lease creation, renewal, and cancellation.
> 
> It's not clear exactly what the performance improvement would be. The 
> same amount of serialization/deserilalization must occur, but YourKit 
> profiling tells me that the serialization overhead is negligible.
> 
> Any thoughts?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message