hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Kennedy <james.kenn...@troove.net>
Subject HBase put() improvement?
Date Wed, 06 Jun 2007 17:05:34 GMT
Hi,

I'm noticing that since the HClient/HRegionServer interface only allows 
for a per-column put(), there is a lot of RPC and some lease management 
overhead when writing large amounts of data. For example:

        for (int i = 0; i < 10000; i++) {
            Text rowKey = new Text(i+"");
            long lock = client.startUpdate(rowKey);
            client.put(lock, COL1, rowKey.getBytes());
            client.put(lock, COL2, someValue.getBytes());
            client.commit(lock);
        }

This code takes my machine (using a single HMaster/HRegionServer on 
local filesystem) approximately 13 seconds to execute. When i measure 
the execution time within HRegionServer.put() I get total time spent in 
put() < 2 seconds. So it looks like there's definately overhead in the 
RPC communication and serialization/deserialization between client and 
server. 

To write 10000 rows, 10000 x (startUpdate=1 +  #cols=2 + commit=1) = 
40000 RPC operations.

What I'm thinking, and please tell me if i'm wrong or if this is already 
in the works, is that if I create a row-level put() method that submits 
a map of column values at once, I would reduce the 2 + (#cols) RPC 
operations to one single atomic row-write RPC as well as eliminate the 
small but noticeable overhead in lease creation, renewal, and cancellation.

It's not clear exactly what the performance improvement would be. The 
same amount of serialization/deserilalization must occur, but YourKit 
profiling tells me that the serialization overhead is negligible.

Any thoughts?






Mime
View raw message