Return-Path: Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: (qmail 60869 invoked from network); 23 Jun 2010 10:09:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 23 Jun 2010 10:09:15 -0000 Received: (qmail 38076 invoked by uid 500); 23 Jun 2010 10:09:14 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 37663 invoked by uid 500); 23 Jun 2010 10:09:10 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 37650 invoked by uid 99); 23 Jun 2010 10:09:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Jun 2010 10:09:09 +0000 X-ASF-Spam-Status: No, hits=2.7 required=10.0 tests=AWL,FREEMAIL_FROM,HK_RANDOM_ENVFROM,HK_RANDOM_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zjffdu@gmail.com designates 74.125.83.176 as permitted sender) Received: from [74.125.83.176] (HELO mail-pv0-f176.google.com) (74.125.83.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Jun 2010 10:09:04 +0000 Received: by pvg11 with SMTP id 11so2225340pvg.35 for ; Wed, 23 Jun 2010 03:08:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=bq6jUU1aaUD4hUcLMdxfnEBlbkRuOm3dBza3ViCH6r4=; b=TcFPmcAiUPh7oMeWwEikmohF/opt1ibOc9WY5aKD1XT3PvL+jpo5IJ5pr0m18x2WLT 8UCCD0r8jwNuVqVWK1BNt4z2lkktfKbSdJbDewIo6IvaPscxx8QFFrlw3pbxEn2GuDJS hBGlQbrLKGtc2RzQ3qDgK807ngbE016gXl6e0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=a2vhszzZ6JLFEPyej3RtSou3JChF4Da3svVmVviSrwuqCUKNyA8cG2/gRij+xKzrNW faBb2EXmKy0TS1FG+Do7bLSCK8Sg4GpYbqTkWQdi3kjSgVDPPF+HAGj9XV6+6LawnilJ 4D5+bd7u+5FlaJ8dTBqXOiMJSotBSyTYhvxZQ= MIME-Version: 1.0 Received: by 10.142.250.10 with SMTP id x10mr6847988wfh.341.1277287724472; Wed, 23 Jun 2010 03:08:44 -0700 (PDT) Received: by 10.142.217.17 with HTTP; Wed, 23 Jun 2010 03:08:44 -0700 (PDT) In-Reply-To: References: <46A377B1A3A3074D8B989BF96663C10DF78EEA4A97@EGL-EX07VS01.ds.corp.yahoo.com> Date: Wed, 23 Jun 2010 18:08:44 +0800 Message-ID: Subject: Re: Questions about recommendation value of the "io.sort.mb" parameter From: Jeff Zhang To: common-dev@hadoop.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi =E6=9D=8E=E9=92=B0 The size of map output depends on your Mapper class. The Mapper class will do processing on the input data. 2010/6/23 =E6=9D=8E=E9=92=B0 : > Hi Sriguru, > > Thanks a lot for your comments and suggestions! > Here I still have some questions: since map mainly do data preparation, > say split input data into KVPs, sort and partition before spill, would th= e > size of map output KVPs be much larger than the input data size? If not, > since one map task deals with one input split, and one input split is > usually 64M, the map KVPs size would be proximately 64M. Could you please > give me some example on map output much larger than the input split? It > really confuse me for some time, thanks. > > Others, > > Also badly need your help if you know about this, thanks. > > Best Regards, > Carp > > =E5=9C=A8 2010=E5=B9=B46=E6=9C=8823=E6=97=A5 =E4=B8=8B=E5=8D=885:11=EF=BC= =8CSrigurunath Chakravarthi =E5=86=99=E9=81=93=EF=BC= =9A > >> Hi Carp, >> =C2=A0Your assumption is right that this is a per-map-task setting. >> However, this buffer stores map output KVPs, not input. Therefore the >> optimal value depends on how much data your map task is generating. >> >> If your output per map is greater than io.sort.mb, these rules of thumb >> that could work for you: >> >> 1) Increase max heap of map tasks to use RAM better, but not hit swap. >> 2) Set io.sort.mb to ~70% of heap. >> >> Overall, causing extra "spills" (because of insufficient io.sort.mb) is >> much better than risking swapping (by setting io.sort.mb and heap too >> large), in terms of relative performance penalty you will pay. >> >> Cheers, >> Sriguru >> >> >-----Original Message----- >> >From: =E6=9D=8E=E9=92=B0 [mailto:carp84@gmail.com] >> >Sent: Wednesday, June 23, 2010 12:27 PM >> >To: common-dev@hadoop.apache.org >> >Subject: Questions about recommendation value of the "io.sort.mb" >> >parameter >> > >> >Dear all, >> > >> >Here I've got a question about the "io.sort.mb" parameter. We can find >> >material from Yahoo! or Cloudera which recommend setting this value to >> >200 >> >if the job scale is large, but I'm confused about this. As I know, >> >the tasktracker will launch a child-JVM for each task, and >> >=E2=80=9C*io.sort.mb*=E2=80=9D >> >presents the buffer size in memory inside *one map task child-JVM*, the >> >default value 100MB should be large enough because the input split of >> >one >> >map task is usually 64MB, as large as the block size we usually set. >> >Then >> >why the recommendation of =E2=80=9C*io.sort.mb*=E2=80=9D is 200MB for l= arge jobs (and >> >it >> >really works)? How could the job size affect the procedure? >> >Is there any fault here of my understanding? Any comment/suggestion >> >will be >> >highly valued, thanks in advance. >> > >> >Best Regards, >> >Carp >> > --=20 Best Regards Jeff Zhang