We have a hadoop 1.2.1 cluster running on EC2. The job tracker is an On Demand box while we task trackers are stop instances all being setup using an AMI. There are normally 100 instances running with about 1500 map and reduce slots. We run hundreds of hive queries everyday, there are no custom map-reduce jobs.
In the past few days, the queries are running slower than before. On investigation we found that the tasks are not starting even though there are free map and reduce slots. The cpu load on the task trackers is around 50%, the load average is lower than the number of cores and the memory utilization is nowhere close to the available max. The tasks are getting stuck in UNASSIGNED state for a few minutes before they start or fail to launch and then start after a few minutes of timeout.
The jobtracker is running at around 20% CPU with 1GB out of 4GB XMX used.
This causes jobs to take longer to start and therefore finish.
Is there something we can do to debug and fix this issue?