On 3 September 2012 15:19, Abhay Ratnaparkhi <abhay.ratnaparkhi@gmail.com> wrote:
Hello,

How can one get to know the nodes on which reduce tasks will run?

One of my job is running and it's completing all the map tasks.
My map tasks write lots of intermediate data. The intermediate directory is getting full on all the nodes. 
If the reduce task take any node from cluster then It'll try to copy the data to same disk and it'll eventually fail due to Disk space related exceptions.


you could always set up specific partitions for intermediate data, though you get better bandwidth by striping the data across all disks, and better flexibility by sharing the same partition.

There's also a property to set the amount of space to allocate for DFS storage; reduce that by changing  dfs.datanode.du.reserved and the datanodes will leave more free space around.

see: http://wiki.apache.org/hadoop/DiskSetup