hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amareshwari Sriramadasu <amar...@yahoo-inc.com>
Subject Re: Tasktrackers job cache directories not always cleaned up
Date Wed, 09 Jul 2008 07:49:04 GMT
The proposal on http://issues.apache.org/jira/browse/HADOOP-3386 takes 
care of this.

Thanks
Amareshwari
Amareshwari Sriramadasu wrote:
> If task tracker didn't receive KillJobAction, its true that job 
> directory will not removed.
> And your observation is correct that some task trackers didn't receive 
> KillJobAction for the job.
> If a reduce task has finished before the job completion, the task will 
> be sent KillTaskAction.
> Looks like there is a bug in sending KillJobAction to the task tracker.
> Could you please file jira for this?
>
> Thanks
> Amareshwari
>> The task subdirectories are being deleted, but the job directory and
>> its work subdirectory are not. This is causing a problem since disk
>> space is filling up over time, and restarting the cluster after a long
>> time is very slow as the tasktrackers clear out the jobcache
>> directories.
>>
>> This doesn't happen for every task run by a tasktracker, but it is
>> happening to a significant number.
>>
>> I think it has something to do with the KillJobAction not being called
>> because if I grep the log for lines from the relevant job containing
>> "Kill" I see this:
>>
>> 2008-07-01 10:15:04,046 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
>> task_200806300936_0279_r_000000_0
>> 2008-07-01 10:15:16,223 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
>> task_200806300936_0279_r_000001_0
>> 2008-07-01 10:15:31,556 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
>> task_200806300936_0279_r_000003_0
>> 2008-07-01 10:15:39,882 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
>> task_200806300936_0279_r_000002_0
>> 2008-07-01 10:15:41,863 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
>> task_200806300936_0279_r_000004_0
>> 2008-07-01 10:15:51,484 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
>> task_200806300936_0279_r_000006_0
>> 2008-07-01 10:15:51,939 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
>> task_200806300936_0279_r_000007_0
>> 2008-07-01 10:15:59,695 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
>> task_200806300936_0279_r_000008_0
>> 2008-07-01 10:16:45,620 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0202:localhost/127.0.0.1:47183 -> KillTaskAction:
>> task_200806300936_0279_r_000005_0
>> 2008-07-01 10:16:47,328 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0216:localhost/127.0.0.1:37282 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:47,334 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f020c:localhost/127.0.0.1:52033 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:47,453 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0210:localhost/127.0.0.1:35235 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:47,768 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f020d:localhost/127.0.0.1:41562 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:48,652 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0203:localhost/127.0.0.1:65277 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:49,005 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0205:localhost/127.0.0.1:48747 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:49,365 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0209:localhost/127.0.0.1:59538 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:49,563 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0214:localhost/127.0.0.1:41484 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:49,747 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f020a:localhost/127.0.0.1:40410 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:50,321 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0212:localhost/127.0.0.1:33514 -> KillJobAction:
>> job_200806300936_0279
>> 2008-07-01 10:16:50,352 DEBUG org.apache.hadoop.mapred.JobTracker:
>> tracker_m0f0207:localhost/127.0.0.1:37241 -> KillJobAction:
>> job_200806300936_0279
>>
>> Notice that tracker_m0f0202 receives a KillTaskAction (which removes
>> the task working directory), but not a KillJobAction (which would
>> remove the job directory). All the other trackers received
>> KillJobAction. I'm not sure what's happening here to cause this.
>>
>> This is on 0.16.4.
>>
>> Anyone else seen this?
>>
>> Tom
>>   
>


Mime
View raw message