hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arkady Borkovsky <ark...@yahoo-inc.com>
Subject Re: [jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs
Date Thu, 24 Aug 2006 16:41:44 GMT
+1 for Owen's arguments.

Although at the code level, anything that is done to the input record  
can be put either into an InputFormat or into a Mapper, it seems to be  
quite important to force a clear separation these concepts.
A more naive user may prefer to be completely ignorant about the notion  
of InputFormat.

Defining interfaces that a easy to understand is more than just  
syntactic sugar, and usability should not be sacrificed to  

On Aug 23, 2006, at 11:49 PM, Doug Cutting (JIRA) wrote:

>     [  
> http://issues.apache.org/jira/browse/HADOOP-372? 
> page=comments#action_12430107 ]
> Doug Cutting commented on HADOOP-372:
> -------------------------------------
>> A very typical case is to have the same input format, but different  
>> Mappers
> But, if the mapper is a function of the input format this can instead  
> be:
> job.addInputPath("foo", FooInput.class);
> job.addInputPath("bar", BarInput.class);
> Where FooInput is defined with something like:
> public class FooInput extends TextInput {
>   public void map(...) { ... };
> }
> In other words, if you're going to define custom mappers anyway, then  
> it's no more work to define custom Input formats.
>> should allow to specify different inputformat classes for different  
>> input dirs for Map/Reduce jobs
>> ---------------------------------------------------------------------- 
>> ----------------------------
>>                 Key: HADOOP-372
>>                 URL: http://issues.apache.org/jira/browse/HADOOP-372
>>             Project: Hadoop
>>          Issue Type: New Feature
>>          Components: mapred
>>    Affects Versions: 0.4.0
>>         Environment: all
>>            Reporter: Runping Qi
>>         Assigned To: Owen O'Malley
>> Right now, the user can specify multiple input directories for a map  
>> reduce job.
>> However, the files under all the directories are assumed to be in the  
>> same format,
>> with the same key/value classes. This proves to be  a serious limit  
>> in many situations.
>> Here is an example. Suppose I have three simple tables:
>> one has URLs and their rank values (page ranks),
>> another has URLs and their classification values,
>> and the third one has the URL meta data such as crawl status, last  
>> crawl time, etc.
>> Suppose now I need a job to generate a list of URLs to be crawled  
>> next.
>> The decision depends on the info in all the three tables.
>> Right now, there is no easy way to accomplish this.
>> However, this job can be done if the framework allows to specify  
>> different inputformats for different input dirs.
>> Suppose my three tables are in the following directory respectively:  
>> rankTable, classificationTable. and metaDataTable.
>> If we extend JobConf class with the following method (as Owen  
>> suggested to me):
>>     addInputPath(aPath, anInputFormatClass, anInputKeyClass,  
>> anInputValueClass)
>> Then I can specify my job as follows:
>>     addInputPath(rankTable, SequenceFileInputFormat.class,  
>> UTF8.class, DoubleWritable.class)
>>     addInputPath(classificationTable, TextInputFormat.class,  
>> UTF8,class, UTF8.class)
>>     addInputPath(metaDataTable, SequenceFileInputFormat.class,  
>> UTF8.class, MyRecord.class)
>> If an input directory is added through the current API, it will have  
>> the same meaning as it is now.
>> Thus this extension will not affect any applications that do not need  
>> this new feature.
>> It is relatively easy for the M/R framework to create an appropriate  
>> record reader for a map task based on the above information.
>> And that is the only change needed for supporting this extension.
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators:  
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:  
> http://www.atlassian.com/software/jira

View raw message