Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A96609EEB for ; Wed, 24 Oct 2012 11:51:50 +0000 (UTC) Received: (qmail 46104 invoked by uid 500); 24 Oct 2012 11:51:46 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 45823 invoked by uid 500); 24 Oct 2012 11:51:46 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 45814 invoked by uid 99); 24 Oct 2012 11:51:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 11:51:45 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com designates 65.55.111.81 as permitted sender) Received: from [65.55.111.81] (HELO blu0-omc2-s6.blu0.hotmail.com) (65.55.111.81) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 11:51:35 +0000 Received: from BLU0-SMTP343 ([65.55.111.71]) by blu0-omc2-s6.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Wed, 24 Oct 2012 04:51:14 -0700 X-Originating-IP: [64.196.194.162] X-EIP: [j4To5jOo2CeyrHIkIC8W5K3k+gZVIUEJ] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Received: from [10.0.0.58] ([64.196.194.162]) by BLU0-SMTP343.blu0.hotmail.com over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Wed, 24 Oct 2012 04:51:12 -0700 From: Michael Segel Content-Type: multipart/alternative; boundary="Apple-Mail=_661AD2D0-A6A0-496B-99F4-F680DBC83C38" MIME-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: How do map tasks get assigned efficiently? Date: Wed, 24 Oct 2012 06:51:07 -0500 References: <038501cdb1ae$476f1320$d64d3960$@yahoo.com> To: user@hadoop.apache.org In-Reply-To: <038501cdb1ae$476f1320$d64d3960$@yahoo.com> X-Mailer: Apple Mail (2.1499) X-OriginalArrivalTime: 24 Oct 2012 11:51:12.0786 (UTC) FILETIME=[DDA1A720:01CDB1DD] X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_661AD2D0-A6A0-496B-99F4-F680DBC83C38 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="windows-1252" So...=20 Data locality only works when you actually have data on the cluster = itself. Otherwise how can the data be local.=20 Assuming 3X replication, and you're not doing a custom split and your = input file is splittable... You will split along the block delineation. So if your input file has 5 = blocks, you will have 5 mappers. Since there are 3 copies of the block, its possible that for that map = task to run on the DN which has a copy of that block.=20 So its pretty straight forward to a point.=20 When your cluster starts to get a lot of jobs and a slot opens up, your = job may not be data local.=20 With HBase... YMMV=20 With S3 the data isn't local so it doesn't matter which Data Node gets = the job.=20 HTH -Mike On Oct 24, 2012, at 1:10 AM, David Parks wrote: > Even after reading O=92reillys book on hadoop I don=92t feel like I = have a clear vision of how the map tasks get assigned. > =20 > They depend on splits right? > =20 > But I have 3 jobs running. And splits will come from various sources: = HDFS, S3, and slow HTTP sources. > =20 > So I=92ve got some concern as to how the map tasks will be distributed = to handle the data acquisition. > =20 > Can I do anything to ensure that I don=92t let the cluster go idle = processing slow HTTP downloads when the boxes could simultaneously be = doing HTTP downloads for one job and reading large files off HDFS for = another job? > =20 > I=92m imagining a scenario where the only map tasks that are running = are all blocking on splits requiring HTTP downloads and the splits = coming from HDFS are all queuing up behind it, when they=92d run more = efficiently in parallel per node. > =20 > =20 --Apple-Mail=_661AD2D0-A6A0-496B-99F4-F680DBC83C38 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="windows-1252" davidparks21@yahoo.com> = wrote:
Even after reading O=92reillys book on hadoop I = don=92t feel like I have a clear vision of how the map tasks get = assigned.
They depend on = splits right?
But I have 3 jobs = running. And splits will come from various sources: HDFS, S3, and slow = HTTP sources.
So I=92ve got some = concern as to how the map tasks will be distributed to handle the data = acquisition.
Can I do anything = to ensure that I don=92t let the cluster go idle processing slow HTTP = downloads when the boxes could simultaneously be doing HTTP downloads = for one job and reading large files off HDFS for another = job?
 
I=92m imagining a scenario where the only map = tasks that are running are all blocking on splits requiring HTTP = downloads and the splits coming from HDFS are all queuing up behind it, = when they=92d run more efficiently in parallel per = node.