hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Kennedy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1468) Add HBase batch update to reduce RPC overhead
Date Wed, 06 Jun 2007 17:41:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502003
] 

James Kennedy commented on HADOOP-1468:
---------------------------------------

The above batch update description implies batch update of columns for a single row.

Another possibility is batch update of multiple rows wherein the client buffers up a number
of row updates and flushes them out together.

> Add HBase batch update to reduce RPC overhead
> ---------------------------------------------
>
>                 Key: HADOOP-1468
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1468
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>    Affects Versions: 0.14.0
>            Reporter: Jim Kellerman
>             Fix For: 0.14.0
>
>
> On Wed, 2007-06-06 at 10:05 -0700, James Kennedy wrote:
> Hi,
> > 
> > I'm noticing that since the HClient/HRegionServer interface only allows 
> > for a per-column put(), there is a lot of RPC and some lease management 
> > overhead when writing large amounts of data. For example:
> > 
> >         for (int i = 0; i < 10000; i++) {
> >             Text rowKey = new Text(i+"");
> >             long lock = client.startUpdate(rowKey);
> >             client.put(lock, COL1, rowKey.getBytes());
> >             client.put(lock, COL2, someValue.getBytes());
> >             client.commit(lock);
> >         }
> > 
> > This code takes my machine (using a single HMaster/HRegionServer on 
> > local filesystem) approximately 13 seconds to execute. When i measure 
> > the execution time within HRegionServer.put() I get total time spent in 
> > put() < 2 seconds. So it looks like there's definately overhead in the 
> > RPC communication and serialization/deserialization between client and 
> > server. 
> > 
> > To write 10000 rows, 10000 x (startUpdate=1 +  #cols=2 + commit=1) = 
> > 40000 RPC operations.
> > 
> > What I'm thinking, and please tell me if i'm wrong or if this is already 
> > in the works, is that if I create a row-level put() method that submits 
> > a map of column values at once, I would reduce the 2 + (#cols) RPC 
> > operations to one single atomic row-write RPC as well as eliminate the 
> > small but noticeable overhead in lease creation, renewal, and cancellation.
> > 
> > It's not clear exactly what the performance improvement would be. The 
> > same amount of serialization/deserilalization must occur, but YourKit 
> > profiling tells me that the serialization overhead is negligible.
> > 
> > Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message