Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of msegel_hadoop@hotmail.com
 designates 65.55.111.94 as permitted sender)
Message-ID: <BLU404-EAS2373209B559201D3F23A3DD98090@phx.gbl>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: Re: question about preserving data locality in MapReduce with Yarn
References: 
 <CADoVA9qcqzhofucT1dt+NCshOMoVYxG600rvt_Zr+_v_V9Qjyg@mail.gmail.com>
From: Michael Segel <msegel_hadoop@hotmail.com>
In-Reply-To: 
 <CADoVA9qcqzhofucT1dt+NCshOMoVYxG600rvt_Zr+_v_V9Qjyg@mail.gmail.com>
Date: Mon, 28 Oct 2013 21:03:58 -0500
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
MIME-Version: 1.0 (1.0)

How do you know where the data exists when you begin?

Sent from a remote device. Please excuse any typos...

Mike Segel

> On Oct 28, 2013, at 8:57 PM, "ricky lee" <rickylee0815@gmail.com> wrote:
>=20
> Hi,
>=20
> I have a question about maintaining data locality in a MapReduce job launc=
hed through Yarn. Based on the Yarn tutorial, it seems like an application m=
aster can specify resource name, memory, and cpu when requesting containers.=
 By carefully choosing resource names, I think the data locality can be achi=
eved. I am curious how the current MapReduce application master is doing thi=
s. Does it check all needed blocks for a job and choose subset of nodes with=
 the most needed blocks? If someone can point me source code snippets that m=
ake this decision, it would be very much appreciated. thx.
>=20
> -r