hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amareshwari Sriramadasu <amar...@yahoo-inc.com>
Subject Re: Tasktrackers job cache directories not always cleaned up
Date Wed, 02 Jul 2008 06:59:16 GMT
If task tracker didn't receive KillJobAction, its true that job 
directory will not removed.
And your observation is correct that some task trackers didn't receive 
KillJobAction for the job.
If a reduce task has finished before the job completion, the task will 
be sent KillTaskAction.
Looks like there is a bug in sending KillJobAction to the task tracker.
Could you please file jira for this?

Thanks
Amareshwari
> The task subdirectories are being deleted, but the job directory and
> its work subdirectory are not. This is causing a problem since disk
> space is filling up over time, and restarting the cluster after a long
> time is very slow as the tasktrackers clear out the jobcache
> directories.
>
> This doesn't happen for every task run by a tasktracker, but it is
> happening to a significant number.
>
> I think it has something to do with the KillJobAction not being called
> because if I grep the log for lines from the relevant job containing
> "Kill" I see this:
>
> 2008-07-01 10:15:04,046 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
> task_200806300936_0279_r_000000_0
> 2008-07-01 10:15:16,223 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
> task_200806300936_0279_r_000001_0
> 2008-07-01 10:15:31,556 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
> task_200806300936_0279_r_000003_0
> 2008-07-01 10:15:39,882 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
> task_200806300936_0279_r_000002_0
> 2008-07-01 10:15:41,863 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
> task_200806300936_0279_r_000004_0
> 2008-07-01 10:15:51,484 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
> task_200806300936_0279_r_000006_0
> 2008-07-01 10:15:51,939 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
> task_200806300936_0279_r_000007_0
> 2008-07-01 10:15:59,695 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
> task_200806300936_0279_r_000008_0
> 2008-07-01 10:16:45,620 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0202:localhost/127.0.0.1:47183 -> KillTaskAction:
> task_200806300936_0279_r_000005_0
> 2008-07-01 10:16:47,328 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0216:localhost/127.0.0.1:37282 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:47,334 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f020c:localhost/127.0.0.1:52033 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:47,453 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0210:localhost/127.0.0.1:35235 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:47,768 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f020d:localhost/127.0.0.1:41562 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:48,652 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0203:localhost/127.0.0.1:65277 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:49,005 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0205:localhost/127.0.0.1:48747 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:49,365 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0209:localhost/127.0.0.1:59538 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:49,563 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:49,747 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f020a:localhost/127.0.0.1:40410 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:50,321 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0212:localhost/127.0.0.1:33514 -> KillJobAction:
> job_200806300936_0279
> 2008-07-01 10:16:50,352 DEBUG org.apache.hadoop.mapred.JobTracker:
> tracker_m0f0207:localhost/127.0.0.1:37241 -> KillJobAction:
> job_200806300936_0279
>
> Notice that tracker_m0f0202 receives a KillTaskAction (which removes
> the task working directory), but not a KillJobAction (which would
> remove the job directory). All the other trackers received
> KillJobAction. I'm not sure what's happening here to cause this.
>
> This is on 0.16.4.
>
> Anyone else seen this?
>
> Tom
>   


Mime
View raw message