hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 李钰 <car...@gmail.com>
Subject Re: Questions about recommendation value of the "io.sort.mb" parameter
Date Wed, 23 Jun 2010 10:33:02 GMT
Hi Jeff,

Thanks for your quick reply. Seems my thinking is stuck on the job style I'm
running. Now I'm much clearer about it.

Best Regards,
Carp

2010/6/23 Jeff Zhang <zjffdu@gmail.com>

> Hi 李钰
>
> The size of map output depends on your Mapper class. The Mapper class
> will do processing on the input data.
>
>
>
> 2010/6/23 李钰 <carp84@gmail.com>:
>  > Hi Sriguru,
> >
> > Thanks a lot for your comments and suggestions!
> > Here I still have some questions: since map mainly do data preparation,
> > say split input data into KVPs, sort and partition before spill, would
> the
> > size of map output KVPs be much larger than the input data size? If not,
> > since one map task deals with one input split, and one input split is
> > usually 64M, the map KVPs size would be proximately 64M. Could you please
> > give me some example on map output much larger than the input split? It
> > really confuse me for some time, thanks.
> >
> > Others,
> >
> > Also badly need your help if you know about this, thanks.
> >
> > Best Regards,
> > Carp
> >
> > 在 2010年6月23日 下午5:11,Srigurunath Chakravarthi <sriguru@yahoo-inc.com>写道:
> >
> >> Hi Carp,
> >>  Your assumption is right that this is a per-map-task setting.
> >> However, this buffer stores map output KVPs, not input. Therefore the
> >> optimal value depends on how much data your map task is generating.
> >>
> >> If your output per map is greater than io.sort.mb, these rules of thumb
> >> that could work for you:
> >>
> >> 1) Increase max heap of map tasks to use RAM better, but not hit swap.
> >> 2) Set io.sort.mb to ~70% of heap.
> >>
> >> Overall, causing extra "spills" (because of insufficient io.sort.mb) is
> >> much better than risking swapping (by setting io.sort.mb and heap too
> >> large), in terms of relative performance penalty you will pay.
> >>
> >> Cheers,
> >> Sriguru
> >>
> >> >-----Original Message-----
> >> >From: 李钰 [mailto:carp84@gmail.com]
> >> >Sent: Wednesday, June 23, 2010 12:27 PM
> >> >To: common-dev@hadoop.apache.org
> >> >Subject: Questions about recommendation value of the "io.sort.mb"
> >> >parameter
> >> >
> >> >Dear all,
> >> >
> >> >Here I've got a question about the "io.sort.mb" parameter. We can find
> >> >material from Yahoo! or Cloudera which recommend setting this value to
> >> >200
> >> >if the job scale is large, but I'm confused about this. As I know,
> >> >the tasktracker will launch a child-JVM for each task, and
> >> >“*io.sort.mb*”
> >> >presents the buffer size in memory inside *one map task child-JVM*, the
> >> >default value 100MB should be large enough because the input split of
> >> >one
> >> >map task is usually 64MB, as large as the block size we usually set.
> >> >Then
> >> >why the recommendation of “*io.sort.mb*” is 200MB for large jobs (and
> >> >it
> >> >really works)? How could the job size affect the procedure?
> >> >Is there any fault here of my understanding? Any comment/suggestion
> >> >will be
> >> >highly valued, thanks in advance.
> >> >
> >> >Best Regards,
> >> >Carp
> >>
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message