Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4031710031 for ; Tue, 29 Oct 2013 02:04:30 +0000 (UTC) Received: (qmail 73536 invoked by uid 500); 29 Oct 2013 02:04:25 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 73378 invoked by uid 500); 29 Oct 2013 02:04:25 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 73371 invoked by uid 99); 29 Oct 2013 02:04:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Oct 2013 02:04:25 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of msegel_hadoop@hotmail.com designates 65.55.111.94 as permitted sender) Received: from [65.55.111.94] (HELO blu0-omc2-s19.blu0.hotmail.com) (65.55.111.94) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Oct 2013 02:04:19 +0000 Received: from BLU404-EAS237 ([65.55.111.72]) by blu0-omc2-s19.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Mon, 28 Oct 2013 19:03:58 -0700 X-TMN: [LugNEFs6jIWByjIjcc3aHwy45lEsc84Z] X-Originating-Email: [msegel_hadoop@hotmail.com] Message-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: Re: question about preserving data locality in MapReduce with Yarn References: From: Michael Segel In-Reply-To: Date: Mon, 28 Oct 2013 21:03:58 -0500 To: "user@hadoop.apache.org" MIME-Version: 1.0 (1.0) X-OriginalArrivalTime: 29 Oct 2013 02:03:58.0559 (UTC) FILETIME=[213CC6F0:01CED44B] X-Virus-Checked: Checked by ClamAV on apache.org How do you know where the data exists when you begin? Sent from a remote device. Please excuse any typos... Mike Segel > On Oct 28, 2013, at 8:57 PM, "ricky lee" wrote: >=20 > Hi, >=20 > I have a question about maintaining data locality in a MapReduce job launc= hed through Yarn. Based on the Yarn tutorial, it seems like an application m= aster can specify resource name, memory, and cpu when requesting containers.= By carefully choosing resource names, I think the data locality can be achi= eved. I am curious how the current MapReduce application master is doing thi= s. Does it check all needed blocks for a job and choose subset of nodes with= the most needed blocks? If someone can point me source code snippets that m= ake this decision, it would be very much appreciated. thx. >=20 > -r