hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amareshwari Sri Ramadasu <amar...@yahoo-inc.com>
Subject Re: Reducers are stuck fetching map data.
Date Wed, 20 Jan 2010 14:22:22 GMT
ReadTimeOuts are found to be costly during shuffle, if the map runtime is high.
Please see HADOOP-3327( http://issues.apache.org/jira/browse/HADOOP-3327) for shuffle improvements
done for ReadTimeOut specificlly

Thanks
Amareshwari

On 1/20/10 6:07 PM, "Suhail Rehman" <suhailrehman@gmail.com> wrote:

We are having trouble running Hadoop MapReduce jobs on our cluster.

VMs running on an IBM blade center with the following virtualized configuration:

Master Node/Namenode: 1x
OS:                 Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
Slaves/DataNode: 3x
OS:                 Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM

We are working with standard Hadoop example code. We are using Hadoop 0.20.1, stable with
the latest patches installed. All VMs have firewalls turned off as well as SELinux disabled.

For example, while we try to execute the "wordcount" program on a provisioned cluster, the
Map operations complete successfully, the program is stuck trying to complete the reduce operations.

On examining the logs, we find that the Reducers are waiting for the outputs from Map operations
on other nodes. Our understanding is that this communication happens over HTTP sockets and
all these provisioned VMs have trouble communicating over the HTTP sockets on the ports that
Hadoop uses.

Also, while trying to access the JobTracker web interface to view the running jobs, we see
that the machine is taking too much time to respond to our queries. Since both of the Reducer
communication and the JobTracker web interface works over HTTP, we think the problem might
be a networking issue or a problem with the built-in HTTP service in Hadoop (Jetty).

Attached is a partial Task log from one of the Reducers,
"WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: Read timed out"
appears on all reducers, and eventually the Job either fails to complete or takes a very long
time (about 15 hours to process a 11 GB text file).

This problem seems to be random and at times the program runs sucessfully in about 20 mins,
othertimes it completes the operation in 15 hours.

Any help with regards to this would be much appreciated.

Regards,

Suhail Rehman
MS by Research in Computer Science
International Institute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~rehman


Mime
View raw message