hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pig <hadoopn...@gmail.com>
Subject Proper blocksize and io.sort.mb setting when using compressed LZO files
Date Fri, 24 Sep 2010 21:06:08 GMT

We just recently switched to using lzo compressed file input for our hadoop
cluster using Kevin Weil's lzo library.  The files are pretty uniform in
size at around 200MB compressed.  Our block size is 256MB.  Decompressed the
average LZO input file is around 1.0GB.  I noticed lots of our jobs are now
spilling lots of data to disk.  We have almost 3x more spilled records than
map input records for example.  I'm guessing this is because each mapper is
getting a 200 MB lzo file which decompresses into 1GB of data per mapper.

Would you recommend solving this by reducing the block size to 64MB, or even
32MB and then using the LZO indexer so that a single 200MB lzo file is
actually split among 3 or 4 mappers?  Would it be better to play with the
io.sort.mb value?  Or, would it be best to play with both? Right now the
io.sort.mb value is the default 200MB. Have other lzo users had to adjust
their block size to compensate for the "expansion" of the data after

Thank you for any help!


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message