hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Duxbury <br...@rapleaf.com>
Subject Re: CRC32 performance
Date Mon, 06 Oct 2008 23:53:47 GMT
I am profiling with YourKit on random reducers. I'm also running on  
HDFS, so I don't know how one would go about disabling CRCs.


On Oct 6, 2008, at 4:35 PM, Doug Cutting wrote:

> How are you profiling?  I don't trust most profilers.
> Have you tried, e.g., disabling checksums and seeing how much  
> performance is actually gained?  For the local filesystem, you can  
> easily disable checksums by binding file: URI's to  
> RawLocalFileSystem in your configuration.
> Doug
> Bryan Duxbury wrote:
>> Hey all,
>> I've been profiling our map/reduce applications quite a bit over  
>> the last few weeks to try and get some performance improvements in  
>> our jobs. I noticed an interesting bottleneck in Hadoop itself I  
>> thought I should bring up.
>> FSDataOutputStream appears to create a CRC of the data being  
>> written via FSOutputSummer.write1. It uses the built-in Java CRC32  
>> implementation to do so. However, out of a 41-second reducer main  
>> thread, this CRC call is taking up around 13 seconds, or about  
>> 32%. This appears to dwarf the actual writing time  
>> (FSOutputSummer.flushBuffer) which only takes 1.9s (5%). This  
>> seems like an incredibly large amount of overhead to pay.
>> To my surprise, there's already a faster CRC implementation in the  
>> Java standard library called Adler32 which is described as "almost  
>> as reliable as a CRC-32 but can be computed much faster". This  
>> sounds very attractive, indeed. Some quick tests indicate that  
>> Adler32 is about 3x as fast.
>> Is there any reason why CRC32 was chosen, or why Adler32 wouldn't  
>> be an acceptable CRC? I understand that Adler32 is bad for small  
>> messages (small as in hundreds of bytes), but since this is behind  
>> a buffered writer, the messages should all be thousands of bytes  
>> to begin with. Worst case, I guess we could select the CRC  
>> algorithm based on the size of the message, using CRC32 for small  
>> messages and Adler32 for bigger ones.
>> -Bryan

View raw message