hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frédéric Bertin <frederic.ber...@anyware-tech.com>
Subject Re: MapReduce: specify a *DFS* path for mapred.jar property
Date Fri, 01 Sep 2006 09:18:04 GMT
Owen O'Malley wrote:
>
> On Aug 31, 2006, at 9:48 AM, Frédéric Bertin wrote:
>
>> Doug Cutting wrote:
>>> Frédéric Bertin wrote:
>>>>>             *// Set the user's name and working directory*
>>>>>             String user = System.getProperty("user.name");
>>>>>             job.setUser(user != null ? user : "Dr Who");
>>>>>             if (job.getWorkingDirectory() == null) {
>>>>>               
>>>>> job.setWorkingDirectory(fs.getWorkingDirectory());                  
   
>>>>> }
>>>
>>> This should run clientside, since it depends on the username, which 
>>> is different on the server.
>> then, what about passing the username as a parameter to the 
>> JobSubmissionProtocol.submitJob(...) ? This avoids loading the whole 
>> JobConf clientside just to set the username.
>
> I don't understand what the problem is. The user sets up their job by 
> creating a JobConf(). Do you already have the job.xml in dfs and just 
> want to resubmit it? I don't think that will ever be the typical 
> case.  I thought the original topic of this thread was the jar file.
>
> -- Owen
yes, you're right, the main topic was indeed about the jar file. What I 
would like to do is put my jobs' jars in a dedicated location on the 
HDFS and refer to them in jobs' config when I submit jobs remotely.

Indeed, I would like to have a centralized jobs repository on the HDFS 
where all jobs will be stored. Something like

/jobs
    /job1
       job1.xml
       job1.jar
    job2/
       job2.xml
       job2.jar
    ...

Then, submitting a job would be as simple as 
JobSubmissionProtocol.submitJob(new Path("/jobs/job1/job1.xml"), 
username, parametersMap), where parametersMap allows to override default 
job1.xml properties, or add some new ones.

This allows for example to set up several remote job schedulers which 
don't need to have jobs' jars. This way, updating a job to a new version 
doesn't require to update all job clients, but only to update the jar 
(and maybe the config file) on the HDFS.

However, due to the use of InputFormat and OutputFormat classes 
clientside, jobs clients need to have jobs' jars, in case custom 
implementations of these classes are used (this is the case for me in 
almost all my jobs).

To summarize, my two main concerns are:

    * save bandwidth between JobClient and JobTracker: that's why I
      would like to avoid the transfer of a "x MB" jar file at each job
      submission. The use of a "hdfs:" protocol in mapred.jar property
      solves one part of the problem: no more transfer needed from
      JobClient to HDFS. However, the problem with Input/Output Format
      classes still requires to transfer the jar from HDFS to
      JobClients. That's why I proposed to move this part of the
      JobClient code to the JobTracker. To avoid the synchronized issue
      pointed out by Doug, we could for example put this code in a new
      non synchronized "boolean validateJob(...)"  method in
      JobSubmissionProtocol. Then JobClient calls validateJob, and if no
      error is returned, calls submitJob.
    * ease job versioning control by trying to centralize jobs location
      as much as possible

thoughts?

Fred




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message