Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates
 209.85.214.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAO6W-2cN_h4NhQaXE7Z1oK2MeqpLPv_jbxJKAFqb2rZ2ZFQNQQ@mail.gmail.com>
References: 
 <CAO6W-2cN_h4NhQaXE7Z1oK2MeqpLPv_jbxJKAFqb2rZ2ZFQNQQ@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Sat, 6 Oct 2012 21:41:18 +0530
Message-ID: 
 <CAOcnVr2AR=oPkRQu=7LnrAgrr1zLe+ZBQVYE6CJyXfqkafbchg@mail.gmail.com>
Subject: Re: Job jar not removed from staging directory on job failure/how to
 share a job jar using distributed cache
To: user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Bertrand,

Yes this is an unfortunate edge case. Though, this is fixed in the
trunk/2.x client rewrite and tracked as a test now by
https://issues.apache.org/jira/browse/MAPREDUCE-2384.

On Fri, Oct 5, 2012 at 10:28 PM, Bertrand Dechoux <dechouxb@gmail.com> wrote:
> Hi,
>
> I am launching my job using the command line and I observed that when the
> provided input path do not match any files, the jar in the staging
> repository is not removed.
> It is removed on job termination (success or failure) but here the job isn't
> even really started so it may be an edge case.
> Has anyone seen the same behaviour? (I am using 1.0.3)
>
> Here is an extract of the stack trace with hadoop related classes.
>
>> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
>> does not exist: [removed]
>>         at
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
>>         at
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
>>         at
>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:902)
>>         at
>> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:919)
>>         at
>> org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:838)
>>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)
>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
>>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494)
>
>
> Second question is a bit related because one of its consequence would
> nullify the impact of the above 'bug'.
> Is it possible to set directly the main job jar as a jar already inside
> HDFS?
> From what I know, the configuration points to a local jar archive which is
> uploaded each time to the staging repository.
>
> The same question was asked in the jira but without clear resolution.
> https://issues.apache.org/jira/browse/MAPREDUCE-236
>
> My question might be related to
> https://issues.apache.org/jira/browse/MAPREDUCE-4408
> which is resolved for next version. But it seems to be only about uberjar
> and I am using a standard jar.
> If it works with a hdfs location, what are the details? Won't it be cleaned
> during job termination? Why not? Will it also be setup within the
> distributed cache?
>
> Regards
>
> Bertrand
>
> PS : I know there are others solutions to my problem. I will look at Oozie.
> And worst case, I can create a FileSystem instance myself to check whether
> the job should be really launched or not. Both could work but both seem
> overkill in my context.


-- 
Harsh J