hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject "lost task" suspect Distributed Cache to blame
Date Tue, 02 Mar 2010 20:49:17 GMT
We are happily launching jobs on our 18.3 cluster at a good clip now.
We have one job in which EVERY map attempt fails with the same
message. Rerunning the job, and it usually runs to completion.

stderr logs

Exception in thread "main" java.lang.NullPointerException
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2186)

The second launch all runs well, but it takes 40 minutes for the first
job to fail with retries and all.

...
2010-03-02 13:17:32,090 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201002111021_0407_m_000058_0: Task
attempt_201002111021_0407_m_000058_0 failed to report status for 600
seconds. Killing!
2010-03-02 13:17:32,098 INFO org.apache.hadoop.mapred.TaskTracker:
Process Thread Dump: lost task
....
Many dumped threads

I noticed some threads are blocked:

Thread 249 (Thread-131):
  State: BLOCKED
  Blocked count: 2
  Waited count: 1
  Blocked on java.util.TreeMap@295acf3
  Blocked by 129 (Thread-65)
  Stack:
    org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:161)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:140)

This job is somewhat special (for us) in that in involves shipping
large files over distributed cache. My working theory is that
something goes wrong with the distributed cache/job tracker and the
Job/Task/Tips never have a chance.

Has anyone ever experienced something like this?

Thank you,
Edward

Mime
View raw message