Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.111.81 as permitted sender)
Message-ID: <BLU0-SMTP34373650756DAD1807408868F780@phx.gbl>
From: Michael Segel <michael_segel@hotmail.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_661AD2D0-A6A0-496B-99F4-F680DBC83C38"
MIME-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: How do map tasks get assigned efficiently?
Date: Wed, 24 Oct 2012 06:51:07 -0500
References: <038501cdb1ae$476f1320$d64d3960$@yahoo.com>
To: user@hadoop.apache.org
In-Reply-To: <038501cdb1ae$476f1320$d64d3960$@yahoo.com>

--Apple-Mail=_661AD2D0-A6A0-496B-99F4-F680DBC83C38
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="windows-1252"

So...=20

Data locality only works when you actually have data on the cluster =
itself. Otherwise how can the data be local.=20

Assuming 3X replication, and you're not doing a custom split and your =
input file is splittable...

You will split along the block delineation.  So if your input file has 5 =
blocks, you will have 5 mappers.

Since there are 3 copies of the block, its possible that for that map =
task to run on the DN which has a copy of that block.=20

So its pretty straight forward to a point.=20

When your cluster starts to get a lot of jobs and a slot opens up, your =
job may not be data local.=20

With HBase... YMMV=20
With S3 the data isn't local so it doesn't matter which Data Node gets =
the job.=20

HTH

-Mike

On Oct 24, 2012, at 1:10 AM, David Parks <davidparks21@yahoo.com> wrote:

> Even after reading O=92reillys book on hadoop I don=92t feel like I =
have a clear vision of how the map tasks get assigned.
> =20
> They depend on splits right?
> =20
> But I have 3 jobs running. And splits will come from various sources: =
HDFS, S3, and slow HTTP sources.
> =20
> So I=92ve got some concern as to how the map tasks will be distributed =
to handle the data acquisition.
> =20
> Can I do anything to ensure that I don=92t let the cluster go idle =
processing slow HTTP downloads when the boxes could simultaneously be =
doing HTTP downloads for one job and reading large files off HDFS for =
another job?
> =20
> I=92m imagining a scenario where the only map tasks that are running =
are all blocking on splits requiring HTTP downloads and the splits =
coming from HDFS are all queuing up behind it, when they=92d run more =
efficiently in parallel per node.
> =20
> =20


--Apple-Mail=_661AD2D0-A6A0-496B-99F4-F680DBC83C38
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="windows-1252"

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dwindows-1252"><base href=3D"x-msg://526/"></head><body =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
">So...&nbsp;<div><br></div><div>Data locality only works when you =
actually have data on the cluster itself. Otherwise how can the data be =
local.&nbsp;</div><div><br></div><div>Assuming 3X replication, and =
you're not doing a custom split and your input file is =
splittable...</div><div><br></div><div>You will split along the block =
delineation. &nbsp;So if your input file has 5 blocks, you will have 5 =
mappers.</div><div><br></div><div>Since there are 3 copies of the block, =
its possible that for that map task to run on the DN which has a copy of =
that block.&nbsp;</div><div><br></div><div>So its pretty straight =
forward to a point.&nbsp;</div><div><br></div><div>When your cluster =
starts to get a lot of jobs and a slot opens up, your job may not be =
data local.&nbsp;</div><div><br></div><div>With HBase... =
YMMV&nbsp;</div><div>With S3 the data isn't local so it doesn't matter =
which Data Node gets the =
job.&nbsp;</div><div><br></div><div>HTH</div><div><br></div><div>-Mike</di=
v><div><br><div><div>On Oct 24, 2012, at 1:10 AM, David Parks &lt;<a =
href=3D"mailto:davidparks21@yahoo.com">davidparks21@yahoo.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><div lang=3D"EN-US" link=3D"blue" vlink=3D"purple" =
style=3D"font-family: Helvetica; font-size: medium; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: =
0px; text-transform: none; white-space: normal; widows: 2; word-spacing: =
0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; =
"><div class=3D"WordSection1" style=3D"page: WordSection1; "><div =
style=3D"margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: =
Calibri, sans-serif; ">Even after reading O=92reillys book on hadoop I =
don=92t feel like I have a clear vision of how the map tasks get =
assigned.<o:p></o:p></div><div style=3D"margin: 0in 0in 0.0001pt; =
font-size: 11pt; font-family: Calibri, sans-serif; =
"><o:p>&nbsp;</o:p></div><div style=3D"margin: 0in 0in 0.0001pt; =
font-size: 11pt; font-family: Calibri, sans-serif; ">They depend on =
splits right?<o:p></o:p></div><div style=3D"margin: 0in 0in 0.0001pt; =
font-size: 11pt; font-family: Calibri, sans-serif; =
"><o:p>&nbsp;</o:p></div><div style=3D"margin: 0in 0in 0.0001pt; =
font-size: 11pt; font-family: Calibri, sans-serif; ">But I have 3 jobs =
running. And splits will come from various sources: HDFS, S3, and slow =
HTTP sources.<o:p></o:p></div><div style=3D"margin: 0in 0in 0.0001pt; =
font-size: 11pt; font-family: Calibri, sans-serif; =
"><o:p>&nbsp;</o:p></div><div style=3D"margin: 0in 0in 0.0001pt; =
font-size: 11pt; font-family: Calibri, sans-serif; ">So I=92ve got some =
concern as to how the map tasks will be distributed to handle the data =
acquisition.<o:p></o:p></div><div style=3D"margin: 0in 0in 0.0001pt; =
font-size: 11pt; font-family: Calibri, sans-serif; =
"><o:p>&nbsp;</o:p></div><div style=3D"margin: 0in 0in 0.0001pt; =
font-size: 11pt; font-family: Calibri, sans-serif; ">Can I do anything =
to ensure that I don=92t let the cluster go idle processing slow HTTP =
downloads when the boxes could simultaneously be doing HTTP downloads =
for one job and reading large files off HDFS for another =
job?<o:p></o:p></div><div style=3D"margin: 0in 0in 0.0001pt; font-size: =
11pt; font-family: Calibri, sans-serif; "><o:p>&nbsp;</o:p></div><div =
style=3D"margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: =
Calibri, sans-serif; ">I=92m imagining a scenario where the only map =
tasks that are running are all blocking on splits requiring HTTP =
downloads and the splits coming from HDFS are all queuing up behind it, =
when they=92d run more efficiently in parallel per =
node.<o:p></o:p></div><div style=3D"margin: 0in 0in 0.0001pt; font-size: =
11pt; font-family: Calibri, sans-serif; "><o:p>&nbsp;</o:p></div><div =
style=3D"margin: 0in 0in 0.0001pt; font-size: 11pt; font-family: =
Calibri, sans-serif; =
"><o:p>&nbsp;</o:p></div></div></div></blockquote></div><br></div></body><=
/html>=

--Apple-Mail=_661AD2D0-A6A0-496B-99F4-F680DBC83C38--