hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris K Wensel <ch...@wensel.net>
Subject Job vs. Configuration
Date Tue, 11 Aug 2009 04:02:09 GMT

Hey all

Looking at (converting to) the new .20 API, I see that the static  
config setters take Job or JobContext, not Configuration.
 >> public static Path[] getInputPaths(JobContext context)

I get the utility of this from the perspective of a user writing  
hadoop jobs. a lot less job.getConfiguration() calls.

But, I do find it odd FileInputFormat, for example, knows about Job  
and JobContext (and children) when it feels as if it should only know  
about Configuration (considering thats all they do is get/set  

 From my perspective, Cascading in part isn't much more than a fancy  
Configuration builder. And the internals all really only care about  
Configuration as they may be asked to provide a property outside the  
context of a job.

So being a builder, a Configuration object is passed around throughout  
the system at different stages (planning, execution, etc) in order to  
accumulate properties from nested components.

With the new API, it all adds up to the need to wrap Configuration in  
a Job/JobContext and then unwrap it so the Configuration instance can  
move down the configuration chain.

But this isn't really possible simply as new Job( configuration ) sets  
the configuration as a default property collection and any set() on  
Job won't influence the defaults. The result is a lot of Configuration  
algebra to merge the final results (or a bit of reflection).

Would it make sense to accept Configuration instead of the JobContext  
and its sub-classes.

You could argue I should just use JobContext in my API's. but again,  
many of my subsystems shouldn't really know of JobContext, they only  
care about manipulating the Configuration object. further, the use of  
Job, JobContext, TaskAttemptContext, etc in the static setters is  
 >>  public static void addInputPath(Job job, Path path) throws  
IOException {

I wonder if Hive and Pig (will) have similar issues.


Chris K Wensel

View raw message