Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAM2hXiUAJfPiR6423kKioSo1=xm0iBwpEU+6QvXC_qPxcXOiXA@mail.gmail.com>
References: 
 <CAM2hXiUAJfPiR6423kKioSo1=xm0iBwpEU+6QvXC_qPxcXOiXA@mail.gmail.com>
Date: Tue, 4 Sep 2012 12:33:31 +0100
Message-ID: 
 <CA+4kjVtoaWhQjF9dpAkAhDG0wnhasFdkzTRU=1J36QMLBu=k5A@mail.gmail.com>
Subject: Re: knowing the nodes on which reduce tasks will run
From: Steve Loughran <stevel@hortonworks.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b41b8dcb1f71b04c8dea0ed

--047d7b41b8dcb1f71b04c8dea0ed
Content-Type: text/plain; charset=UTF-8

On 3 September 2012 15:19, Abhay Ratnaparkhi <abhay.ratnaparkhi@gmail.com>wrote:

> Hello,
>
> How can one get to know the nodes on which reduce tasks will run?
>
> One of my job is running and it's completing all the map tasks.
> My map tasks write lots of intermediate data. The intermediate directory
> is getting full on all the nodes.
> If the reduce task take any node from cluster then It'll try to copy the
> data to same disk and it'll eventually fail due to Disk space related
> exceptions.
>
>
you could always set up specific partitions for intermediate data, though
you get better bandwidth by striping the data across all disks, and better
flexibility by sharing the same partition.

There's also a property to set the amount of space to allocate for DFS
storage; reduce that by changing  dfs.datanode.du.reserved and the
datanodes will leave more free space around.

see: http://wiki.apache.org/hadoop/DiskSetup

--047d7b41b8dcb1f71b04c8dea0ed
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<br><br><div class=3D"gmail_quote">On 3 September 2012 15:19, Abhay Ratnapa=
rkhi <span dir=3D"ltr">&lt;<a href=3D"mailto:abhay.ratnaparkhi@gmail.com" t=
arget=3D"_blank">abhay.ratnaparkhi@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex">
Hello,<div><br></div><div>How can one get to know the nodes on which reduce=
 tasks will run?</div><div><br></div><div>One of my job is running and it&#=
39;s completing all the map tasks.</div><div>My map tasks write lots of int=
ermediate data. The intermediate directory is getting full on all the nodes=
.=C2=A0</div>

<div>If the reduce task take any node from cluster then It&#39;ll try to co=
py the data to same disk and it&#39;ll eventually fail due to Disk space re=
lated exceptions.</div><div><br></div></blockquote><div><br></div><div>
you could always set up specific partitions for intermediate data, though y=
ou get better bandwidth by striping the data across all disks, and better f=
lexibility by sharing the same partition.</div><div><br></div><div>There=
9;s also a property to set the amount of space to allocate for DFS storage;=
 reduce that by changing=C2=A0=C2=A0dfs.datanode.du.reserved and the datano=
des will leave more free space around.</div>
<div><br></div><div>see:=C2=A0<a href=3D"http://wiki.apache.org/hadoop/Disk=
Setup">http://wiki.apache.org/hadoop/DiskSetup</a></div></div>

--047d7b41b8dcb1f71b04c8dea0ed--