hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <e...@lifeless.net>
Subject Re: Exponential performance decay when inserting large number of blocks
Date Thu, 14 Jan 2010 02:59:13 GMT
On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
> Alex, Dhruba
> I repeated the experiment increasing the block size to 32k.  Still doing
> 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
> also running iostat on one of the datanodes.  Did not notice anything
> that would explain an exponential slowdown.  There was more activity
> while the inserts were active but far from the limits of the disk system.

While creating many blocks, could it be that the replication pipe lining
is eating up the available handler threads on the data nodes? By
increasing the block size you would see better performance because the
system spends more time writing data to local disk and less time dealing
with things like replication "overhead." At a small block size, I could
imagine you're artificially creating a situation where you saturate the
default size configured thread pools or something weird like that.

If you're doing 8 inserts in parallel from one machine with 11 nodes
this seems unlikely, but it might be worth looking into. The question is
if testing with an artificially small block size like this is even a
viable test. At some point the overhead of talking to the name node,
selecting data nodes for a block, and setting up replication pipe lines
could become some abnormally high percentage of the run time.

Also, I wonder if the cluster is trying to rebalance blocks toward the
end of your runtime (if the balancer daemon is running) and this is
causing additional shuffling of data.

Just throwing ideas out there. I don't know if this is reasonable at
all. I've never tested with a small block size like that and I don't
know the exact amount of overhead in some of these bits of the code.

Eric Sammer

View raw message