Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BF47F11BED for ; Sun, 12 May 2013 12:34:01 +0000 (UTC) Received: (qmail 60627 invoked by uid 500); 12 May 2013 12:33:57 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 60295 invoked by uid 500); 12 May 2013 12:33:56 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 60259 invoked by uid 99); 12 May 2013 12:33:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 May 2013 12:33:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rahul.rec.dgp@gmail.com designates 209.85.220.177 as permitted sender) Received: from [209.85.220.177] (HELO mail-vc0-f177.google.com) (209.85.220.177) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 May 2013 12:33:49 +0000 Received: by mail-vc0-f177.google.com with SMTP id ha12so4814109vcb.36 for ; Sun, 12 May 2013 05:33:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=Jh+x3milaSMNYpFGW+dMrbARKc0u0lj9FPFjoq+8FFA=; b=EJEIam8z9T5pqLU7Z+/i3z5PyVFCgA8lG1GgCrpzuBpaDXrvMeFwFpkaV6Z+rXDhwd KmOAMvXnwoegRmX92ivztJXzFb7mJ2LMHtqt60+pNTB7xM3zl0SU8gFkpPkL9jSYPdW5 2zRCdgCewTOU1cbguIp7/6vdz9/EOfeHi5KmMBiIhDFbgK9FJLlWA++Z2PklPaEKN1ZW 2KDknHbceum4pizPWc0aBhsax5fIKCWSGEB5Rf4seG2bz3y/S94s9QuFvOTBvBIb4jPH VoQZecnWLKNegJe+/7xHi6jNYoml1H3KmzYOu52CCYLS6D4yBsBYeGxOjSH/RxwQa8OE 7ZDQ== X-Received: by 10.52.179.105 with SMTP id df9mr13650205vdc.49.1368362009128; Sun, 12 May 2013 05:33:29 -0700 (PDT) MIME-Version: 1.0 Received: by 10.59.6.68 with HTTP; Sun, 12 May 2013 05:33:09 -0700 (PDT) In-Reply-To: References: From: Rahul Bhattacharjee Date: Sun, 12 May 2013 18:03:09 +0530 Message-ID: Subject: Re: Need help about task slots To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=bcaec517234d7a754704dc849b68 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec517234d7a754704dc849b68 Content-Type: text/plain; charset=UTF-8 Oh! I though distcp works on complete files rather then mappers per datablock. So I guess parallelism would still be there if there are multipel files.. please correct if ther is anything wrong. Thank, Rahul On Sun, May 12, 2013 at 5:39 PM, Mohammad Tariq wrote: > @Rahul : I'm sorry as I am not aware of any such document. But you could > use distcp for local to HDFS copy : > *bin/hadoop distcp file:///home/tariq/in.txt hdfs://localhost:9000/* > * > * > And yes. When you use distcp from local to HDFS, you can't take the > pleasure of parallelism as the data is stored in a non distributed fashion. > > Warm Regards, > Tariq > cloudfront.blogspot.com > > > On Sat, May 11, 2013 at 11:07 PM, Mohammad Tariq wrote: > >> Hello guys, >> >> My 2 cents : >> >> Actually no. of mappers is primarily governed by the no. of InputSplits >> created by the InputFormat you are using and the no. of reducers by the no. >> of partitions you get after the map phase. Having said that, you should >> also keep the no of slots, available per slave, in mind, along with the >> available memory. But as a general rule you could use this approach : >> >> Take the no. of virtual CPUs*.75 and that's the no. of slots you can >> configure. For example, if you have 12 physical cores (or 24 virtual >> cores), you would have (24*.75)=18 slots. Now, based on your requirement >> you could choose how many mappers and reducers you want to use. With 18 MR >> slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers >> or whatever you think is OK with you. >> >> I don't know if it ,makes much sense, but it helps me pretty decently. >> >> Warm Regards, >> Tariq >> cloudfront.blogspot.com >> >> >> On Sat, May 11, 2013 at 8:57 PM, Rahul Bhattacharjee < >> rahul.rec.dgp@gmail.com> wrote: >> >>> Hi, >>> >>> I am also new to Hadoop world , here is my take on your question , if >>> there is something missing then others would surely correct that. >>> >>> For per-YARN , the slots are fixed and computed based on the crunching >>> capacity of the datanode hardware , once the slots per data node is >>> ascertained , they are divided into Map and reducer slots and that goes >>> into the config files and remain fixed , until changed.In YARN , its >>> decided at runtime based on the kind of requirement of particular task.Its >>> very much possible that a datanode at certain point of time running 10 >>> tasks and another similar datanode is only running 4 tasks. >>> >>> Coming to your question. Based of the data set size , block size of dfs >>> and input formater , the number of map tasks are decided , generally for >>> file based inputformats its one mapper per data block , however there are >>> way to change this using configuration settings.Reduce tasks are set using >>> job configuration. >>> >>> General rule as I have read from various documents is that Mappers >>> should run atleast a minute , so you can run a sample to find out a good >>> size of data block which would make you mapper run more than a minute. Now >>> it again depends on your SLA , in case you are not looking for a very small >>> SLA you can choose to run less mappers at the expense of higher runtime. >>> >>> But again its all theory , not sure how these things are handled in >>> actual prod clusters. >>> >>> HTH, >>> >>> >>> >>> Thanks, >>> Rahul >>> >>> >>> On Sat, May 11, 2013 at 8:02 PM, Shashidhar Rao < >>> raoshashidhar123@gmail.com> wrote: >>> >>>> Hi Users, >>>> >>>> I am new to Hadoop and confused about task slots in a cluster. How >>>> would I know how many task slots would be required for a job. Is there any >>>> empirical formula or on what basis should I set the number of task slots. >>>> >>>> Advanced Thanks >>>> >>> >>> >> > --bcaec517234d7a754704dc849b68 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Oh! I though distcp works on complete files= rather then mappers per datablock.
So I guess parallelism would still be there if there are multipel files.. p= lease correct if ther is anything wrong.

Thank,=
Rahul

On Sun, May 12, 2013 at 5:39 PM, Mohammad Tari= q <dontariq@gmail.com> wrote:
@Rahul : I'm sorry as I= am not aware of any such document. But you could use distcp for local to H= DFS copy :
bin/hadoop =C2=A0distcp =C2=A0file:///home/tariq/in.txt =C2=A0hdfs://loc= alhost:9000/

And yes. When you use distcp from local to HDFS, you = can't take the pleasure of parallelism as the data is stored in a non d= istributed fashion.

Warm Regards,
Tariq


On Sat, May 11, 2013 at 11:07 PM, Mohamm= ad Tariq <dontariq@gmail.com> wrote:
Hel= lo guys,=C2=A0

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 My 2 cents :=C2=A0

Actually no. of mappers is primarily governed by the no. of InputSplits cre= ated by the InputFormat you are using and the no. of reducers by the no. of= partitions you get after the map phase. Having said that, you should also = keep the no of slots, available per slave, in mind, along with the availabl= e memory. But as a general rule you could use this approach :

Take the no. of virtual CPUs*.75 and that's the no. of slots you can co= nfigure. For example, if you have 12 physical cores (or 24 virtual cores), = you would have (24*.75)=3D18 slots. Now, based on your requirement you coul= d choose how many mappers and reducers you want to use. With 18 MR slots, y= ou could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or what= ever you think is OK with you.=C2=A0

I don't know if it ,makes much sense, but it helps me pretty decently.<= br>


Warm Regards,
Tariq


On Sat, May 11, 2013 at 8:57 PM, Rahul B= hattacharjee <rahul.rec.dgp@gmail.com> wrote:
Hi,

I am also new to Hadoop world , here is my ta= ke on your question , if there is something missing then others would surel= y correct that.

For per-YARN , the slots are fixed and computed based on the crunch= ing capacity of the datanode hardware , once the slots per data node is asc= ertained , they are divided into Map and reducer slots and that goes into t= he config files and remain fixed , until changed.In YARN , its decided at r= untime based on the kind of requirement of particular task.Its very much po= ssible that a datanode at certain point of time running=C2=A0 10 tasks and = another similar datanode is only running 4 tasks.

Coming to your question. Based of the data set size , block size of= dfs and input formater , the number of map tasks are decided , generally f= or file based inputformats its one mapper per data block , however there ar= e way to change this using configuration settings.Reduce tasks are set usin= g job configuration.

General rule as I have read from various documents is that Mappers = should run atleast a minute , so you can run a sample to find out a good si= ze of data block which would make you mapper run more than a minute. Now it= again depends on your SLA , in case you are not looking for a very small S= LA you can choose to run less mappers at the expense of higher runtime.

But again its all theory , not sure how these things are handled in= actual prod clusters.

HTH,



Thanks,
Rahul

On Sat, May 11, 2013 at 8:02 PM, Shashidhar Ra= o <raoshashidhar123@gmail.com> wrote:
Hi Users,

=
I am new to Hadoop and confused about task slots in a cluster. How wo= uld I know how many task slots would be required for a job. Is there any em= pirical formula or on what basis should I set the number of task slots.

Advanced Thanks




--bcaec517234d7a754704dc849b68--