Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 10A7FDD54 for ; Thu, 25 Oct 2012 02:50:29 +0000 (UTC) Received: (qmail 56939 invoked by uid 500); 25 Oct 2012 02:50:23 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 56854 invoked by uid 500); 25 Oct 2012 02:50:23 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 56837 invoked by uid 99); 25 Oct 2012 02:50:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Oct 2012 02:50:22 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [98.137.177.229] (HELO nm17-vm5.bullet.mail.gq1.yahoo.com) (98.137.177.229) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Oct 2012 02:50:11 +0000 Received: from [98.137.12.174] by nm17.bullet.mail.gq1.yahoo.com with NNFMP; 25 Oct 2012 02:49:48 -0000 Received: from [208.71.42.197] by tm13.bullet.mail.gq1.yahoo.com with NNFMP; 25 Oct 2012 02:49:48 -0000 Received: from [127.0.0.1] by smtp208.mail.gq1.yahoo.com with NNFMP; 25 Oct 2012 02:49:48 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1351133388; bh=tQyPH19bt3yVeJjhx6tVU/cLOmvwmSHKXw24bwtKV5M=; h=X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:Received:From:To:References:In-Reply-To:Subject:Date:Message-ID:MIME-Version:Content-Type:X-Mailer:Thread-Index:Content-Language; b=3rOr3rPN21tXKS1BMrNbApMyJZF1h4q2F6i29RzuDfT7PfNq6kSAztLZ/oeH1VRdxtnvodRFRIXGb782HwP1SN9rGf4wm5qACTbasqbem8O92vpgewVwngYRzpLBEgbFXzeyxDM2e6I25G/3WyEvgMDu7HLGozUtWxvbx6FLLF8= X-Yahoo-Newman-Id: 690680.19968.bm@smtp208.mail.gq1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: gjp0JR4VM1lnS6Llyn4EFG1kKa_VNDoIVanpKT6cM2AbtLS FcLI.8yCRpB8yzglPgdp6HOT2ANQk4Lk6aQnmN_gRP2Fqb5AN_mdwW2vyqPw bK31_Y2A.A65xG4o9O9NdeyGpa3DXRX8s6bhY_JrO.ixVt9aNrZsLRGCqGNk DKiYk_HXL0eh0zcipbK.eFauYaUDfM.01WfFST.BxfSPd4lrw19PLs7LPsLR _EMJxOuOx9lipe8863WtNZpolKybjQm2Mhp5nDckf7_ORBgeijtrp6EvPfws jwtwdSwRzC.sByTQIEHQw90d3OYa8pcNzM4bhAcpuSw8_gT6YYtnTvHaelyW uVkHYWrw.si6cE6kbiyoIQsYqI12kGHMjmNNe7__FcIx4wv3nMKB1gtcR.4h mk2qxbfrZMfjA_40sP8Hp2u_L.NnInv8HsnA1qmGV2hMAdCqObpGDXm2Skb8 plBA5Lk.lEaXPumjbXPCpSqBye3MKqCR0XHx0 X-Yahoo-SMTP: k2gD1GeswBAV_JFpZm8dmpTCwr4ufTKOyA-- Received: from sattelite (davidparks21@113.161.75.108 with login) by smtp208.mail.gq1.yahoo.com with SMTP; 24 Oct 2012 19:49:48 -0700 PDT From: "David Parks" To: References: <038501cdb1ae$476f1320$d64d3960$@yahoo.com> In-Reply-To: Subject: RE: How do map tasks get assigned efficiently? Date: Thu, 25 Oct 2012 09:49:43 +0700 Message-ID: <011201cdb25b$64c87800$2e596800$@yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0113_01CDB296.112924C0" X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQLFLZeDxeVU7H0n+YT+wnJWcgJRrwDJR2ZNldQQshA= Content-Language: en-us X-Virus-Checked: Checked by ClamAV on apache.org This is a multipart message in MIME format. ------=_NextPart_000_0113_01CDB296.112924C0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit So the thing that just doesn't click for me yet is this: On a typical computer, if I try to read two huge files off disk simultaneously it'll just kill the disk performance. This seems like a risk. What's preventing such disk contention in Hadoop? Is HDFS smart enough to serialize major disk access? From: Michael Segel [mailto:michael_segel@hotmail.com] Sent: Wednesday, October 24, 2012 6:51 PM To: user@hadoop.apache.org Subject: Re: How do map tasks get assigned efficiently? So... Data locality only works when you actually have data on the cluster itself. Otherwise how can the data be local. Assuming 3X replication, and you're not doing a custom split and your input file is splittable... You will split along the block delineation. So if your input file has 5 blocks, you will have 5 mappers. Since there are 3 copies of the block, its possible that for that map task to run on the DN which has a copy of that block. So its pretty straight forward to a point. When your cluster starts to get a lot of jobs and a slot opens up, your job may not be data local. With HBase... YMMV With S3 the data isn't local so it doesn't matter which Data Node gets the job. HTH -Mike On Oct 24, 2012, at 1:10 AM, David Parks wrote: Even after reading O'reillys book on hadoop I don't feel like I have a clear vision of how the map tasks get assigned. They depend on splits right? But I have 3 jobs running. And splits will come from various sources: HDFS, S3, and slow HTTP sources. So I've got some concern as to how the map tasks will be distributed to handle the data acquisition. Can I do anything to ensure that I don't let the cluster go idle processing slow HTTP downloads when the boxes could simultaneously be doing HTTP downloads for one job and reading large files off HDFS for another job? I'm imagining a scenario where the only map tasks that are running are all blocking on splits requiring HTTP downloads and the splits coming from HDFS are all queuing up behind it, when they'd run more efficiently in parallel per node. ------=_NextPart_000_0113_01CDB296.112924C0 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

So the thing that just doesn’t click for me yet is = this:

 

On a typical computer, if I try to read two huge files off disk = simultaneously it’ll just kill the disk performance. This seems = like a risk.

 

What’s preventing such disk contention in Hadoop?  Is HDFS = smart enough to serialize major disk access?

 

 

From:= = Michael Segel [mailto:michael_segel@hotmail.com]
Sent: = Wednesday, October 24, 2012 6:51 PM
To: = user@hadoop.apache.org
Subject: Re: How do map tasks get = assigned efficiently?

 

So... 

 

Data locality only works when you actually have data = on the cluster itself. Otherwise how can the data be = local. 

 

Assuming 3X replication, and you're not doing a custom = split and your input file is splittable...

 

You will split along the block delineation.  So = if your input file has 5 blocks, you will have 5 = mappers.

 

Since there are 3 copies of the block, its possible = that for that map task to run on the DN which has a copy of that = block. 

 

So its pretty straight forward to a = point. 

 

When your cluster starts to get a lot of jobs and a = slot opens up, your job may not be data = local. 

 

With HBase... YMMV 

With S3 the data isn't local so it doesn't matter = which Data Node gets the job. 

 

HTH

 

-Mike

 

On = Oct 24, 2012, at 1:10 AM, David Parks <davidparks21@yahoo.com> = wrote:



Even after = reading O’reillys book on hadoop I don’t feel like I have a = clear vision of how the map tasks get = assigned.

 =

They = depend on splits right?

 =

But I have = 3 jobs running. And splits will come from various sources: HDFS, S3, and = slow HTTP sources.

 =

So = I’ve got some concern as to how the map tasks will be distributed = to handle the data acquisition.

 =

Can I do = anything to ensure that I don’t let the cluster go idle processing = slow HTTP downloads when the boxes could simultaneously be doing HTTP = downloads for one job and reading large files off HDFS for another = job?

 =

I’m = imagining a scenario where the only map tasks that are running are all = blocking on splits requiring HTTP downloads and the splits coming from = HDFS are all queuing up behind it, when they’d run more = efficiently in parallel per node.

 =

 =

 

------=_NextPart_000_0113_01CDB296.112924C0--