hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srigurunath Chakravarthi <srig...@yahoo-inc.com>
Subject RE: Proper blocksize and io.sort.mb setting when using compressed LZO files
Date Sun, 26 Sep 2010 07:05:40 GMT
 Tuning io.sort.mb will be certainly worthwhile if you have enough RAM to allow for a higher
Java heap per map task without risking swapping.

 Similarly, you can decrease spills on the reduce side using fs.inmemorysize.mb.

You can use the following thumb rules for tuning those two:

- Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% RAM across all processes
(maps, reducers, TT, DN, other)
- Set it small enough to avoid swap activity, but
- Set it large enough to minimize disk spills.
- Ensure that io.sort.factor is set large enough to allow full use of buffer space.
- Balance space for output records (default 95%) & record meta-data (5%). Use io.sort.spill.percent
and io.sort.record.percent

 Your mileage may vary. We've seen job exec time improvements worth 1-3% via spill-avoidance
for miscellaneous applications.

 Your other option of running a map per 32MB or 64MB of input should give you better performance
if your map task execution time is significant (i.e., much larger than a few seconds) compared
to the overhead of launching map tasks and reading input.


>-----Original Message-----
>From: pig [mailto:hadoopnode@gmail.com]
>Sent: Saturday, September 25, 2010 2:36 AM
>To: common-user@hadoop.apache.org
>Subject: Proper blocksize and io.sort.mb setting when using compressed
>LZO files
>We just recently switched to using lzo compressed file input for our
>cluster using Kevin Weil's lzo library.  The files are pretty uniform
>size at around 200MB compressed.  Our block size is 256MB.
>Decompressed the
>average LZO input file is around 1.0GB.  I noticed lots of our jobs are
>spilling lots of data to disk.  We have almost 3x more spilled records
>map input records for example.  I'm guessing this is because each
>mapper is
>getting a 200 MB lzo file which decompresses into 1GB of data per
>Would you recommend solving this by reducing the block size to 64MB, or
>32MB and then using the LZO indexer so that a single 200MB lzo file is
>actually split among 3 or 4 mappers?  Would it be better to play with
>io.sort.mb value?  Or, would it be best to play with both? Right now
>io.sort.mb value is the default 200MB. Have other lzo users had to
>their block size to compensate for the "expansion" of the data after
>Thank you for any help!

View raw message