hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: Question on Critical Region size for SequenceFile next/write - 0.15.1
Date Wed, 12 Dec 2007 22:07:07 GMT
We have been monitoring the performance of our jobs using slaves.sh vmstat 5

When we are running the very simple mappers, that basically read input, 
do very very little and write output, neither the cpu or the disk are 
being fully utilized. We expect to saturate on either cpu or on disk. It 
may be we are saturating on network, but our network read speed is about 
the same as our disk read speed ~50mb/sec.
We only see about 1/5 of the disk bandwidth and 1/5 of the cpu being 
utilized, and increasing the number of threads doesn't change the 

Our theory is that the serialization time (not the disk write time) and 
the deserialization time (not the disk read time) is the bottleneck.
I have some test code nearly ready to go, if it changes the machine 
utilization on my standard job, I will let you know...

Doug Cutting wrote:
> Jason Venner wrote:
>> On investigating, we discovered that the entirety of the 
>> next(key,value) and the entirety of the write( key, value) are 
>> synchronized on the file object.
>> This causes all threads to back up on the serialization/deserialization.
> I'm not sure what you want to happen here.  If you've got a bunch of 
> threads writing to a single file, and that's your performance 
> bottleneck, I don't see how to improve the situation except to write 
> to multiple files on different drives, or to spread your load across a 
> larger cluster (another way to get more drives).
> Doug

View raw message