hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs
Date Thu, 20 Jul 2006 10:14:15 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12422385 ] 
            
Doug Cutting commented on HADOOP-372:
-------------------------------------

Can you provide more details?  Is the intent for the mapred.input.format.class property to
become multivalued, a parallel list to mapred.input.dir, and when the latter is longer than
the former, the first (or last?) input format is used for unmatched entries?  I can imagine
how MapTask might create its keys, values and a RecordReader, but how would getSplits() and
checkInputDirectories() work?

Another approach to implementing this is to write an InputFormat that wraps keys and/or values
from files of different types in ObjectWritable.  Then map() methods unwrap, introspect and
cast.  With your approach map methods still need to introspect and cache, this just adds the
wrapper.

To eliminate the wrapper we'd need to move the getInputKeyClass() and getInputValueClass()
methods to RecordReader.  These are only called in MapRunner.java, when a RecordReader is
already available, so this would be an easy change, and the default implementation could be
back-compatible, accessing the job.

That's a simpler approach, no?  Just add files with different keys and value types, and let
the types in the files drive things rather than having to declare them up front.

> should allow to specify different inputformat classes for different input dirs for Map/Reduce
jobs
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-372
>                 URL: http://issues.apache.org/jira/browse/HADOOP-372
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.4.0
>         Environment: all
>            Reporter: Runping Qi
>
> Right now, the user can specify multiple input directories for a map reduce job. 
> However, the files under all the directories are assumed to be in the same format, 
> with the same key/value classes. This proves to be  a serious limit in many situations.

> Here is an example. Suppose I have three simple tables: 
> one has URLs and their rank values (page ranks), 
> another has URLs and their classification values, 
> and the third one has the URL meta data such as crawl status, last crawl time, etc. 
> Suppose now I need a job to generate a list of URLs to be crawled next. 
> The decision depends on the info in all the three tables.
> Right now, there is no easy way to accomplish this.
> However, this job can be done if the framework allows to specify different inputformats
for different input dirs.
> Suppose my three tables are in the following directory respectively: rankTable, classificationTable.
and metaDataTable. 
> If we extend JobConf class with the following method (as Owen suggested to me):
>     addInputPath(aPath, anInputFormatClass, anInputKeyClass, anInputValueClass)
> Then I can specify my job as follows:
>     addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, DoubleWritable.class)
>     addInputPath(classificationTable, TextInputFormat.class, UTF8,class, UTF8.class)
>     addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, MyRecord.class)
> If an input directory is added through the current API, it will have the same meaning
as it is now. 
> Thus this extension will not affect any applications that do not need this new feature.
> It is relatively easy for the M/R framework to create an appropriate record reader for
a map task based on the above information.
> And that is the only change needed for supporting this extension.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message