hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srigurunath Chakravarthi <srig...@yahoo-inc.com>
Subject RE: Questions about recommendation value of the "io.sort.mb" parameter
Date Wed, 23 Jun 2010 09:11:45 GMT
Hi Carp,
 Your assumption is right that this is a per-map-task setting.
However, this buffer stores map output KVPs, not input. Therefore the optimal value depends
on how much data your map task is generating.

If your output per map is greater than io.sort.mb, these rules of thumb that could work for

1) Increase max heap of map tasks to use RAM better, but not hit swap.
2) Set io.sort.mb to ~70% of heap.

Overall, causing extra "spills" (because of insufficient io.sort.mb) is much better than risking
swapping (by setting io.sort.mb and heap too large), in terms of relative performance penalty
you will pay.


>-----Original Message-----
>From: 李钰 [mailto:carp84@gmail.com]
>Sent: Wednesday, June 23, 2010 12:27 PM
>To: common-dev@hadoop.apache.org
>Subject: Questions about recommendation value of the "io.sort.mb"
>Dear all,
>Here I've got a question about the "io.sort.mb" parameter. We can find
>material from Yahoo! or Cloudera which recommend setting this value to
>if the job scale is large, but I'm confused about this. As I know,
>the tasktracker will launch a child-JVM for each task, and
>presents the buffer size in memory inside *one map task child-JVM*, the
>default value 100MB should be large enough because the input split of
>map task is usually 64MB, as large as the block size we usually set.
>why the recommendation of “*io.sort.mb*” is 200MB for large jobs (and
>really works)? How could the job size affect the procedure?
>Is there any fault here of my understanding? Any comment/suggestion
>will be
>highly valued, thanks in advance.
>Best Regards,
View raw message