hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frédéric Bertin <frederic.ber...@anyware-tech.com>
Subject Re: MapReduce: specify a *DFS* path for mapred.jar property
Date Fri, 01 Sep 2006 21:33:37 GMT
Doug Cutting wrote:
> Frédéric Bertin wrote:
>> Indeed, I would like to have a centralized jobs repository on the 
>> HDFS where all jobs will be stored. Something like
>> /jobs
>>    /job1
>>       job1.xml
>>       job1.jar
>>    job2/
>>       job2.xml
>>       job2.jar
>>    ...
>> Then, submitting a job would be as simple as 
>> JobSubmissionProtocol.submitJob(new Path("/jobs/job1/job1.xml"), 
>> username, parametersMap), where parametersMap allows to override 
>> default job1.xml properties, or add some new ones.
> So the job1.xml isn't really a job, but rather some configuration 
> defaults.   Would this need be met if config defaults were nameable by 
> hdfs uris?
> This way, you could create a new JobConf, set its jar file to be an 
> hdfs uri, then add an hdfs uri as a default configuration resource, 
> then submit the job.  How does that sound?
that sounds indeed like a very good idea !! And it perfectly allows to 
separate "internal" job properties (mapper/reduce class, ...), and 
application-level ones.
>> To summarize, my two main concerns are:
>>    * save bandwidth between JobClient and JobTracker: that's why I
>>      would like to avoid the transfer of a "x MB" jar file at each job
>>      submission.
> Have you actually seen this to be a significant portion of job 
> execution time?
No, I can't say for now since we are still using a local test 
environment. Anyway, this is less a matter of job execution time than a 
matter of anticipating bandwidth usage and trying to save it as early as 

Indeed, in the future, we will have to submit jobs over the Internet to 
a remote HDFS cluster. Some of these jobs will be scheduled at fixed 
rate (for example, every hour). That's why transferring the same Jar 
every hour (increased by the number of jobs to be run) initially seemed 
inefficient (and unnecessary) to me. Even if we may have a  large 
bandwidth,  it will be shared among many other applications,  hence my 

>>    * ease job versioning control by trying to centralize jobs location
>>      as much as possible
> Wouldn't a version control system like subversion be a better way to 
> meet this goal?  We're talking about versioning your software, not 
> data, right?
yes, it's about easing the deployment of new jobs (= jars) versions.



View raw message