hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs
Date Mon, 11 Jun 2007 17:34:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503493

Doug Cutting commented on HADOOP-372:

bq. The other major change is to InputSplits, which get the ability to define their RecordReader
and Mapper.

Hmm.  This is a big departure from the previously proposed API:
public interface JobInput {
  public List<InputSplit> getSplits(JobConf);
  public RecordReader getRecordReader(JobConf, InputSplit);
  public Mapper getMapper(JobConf, InputSplit, RecordReader);

Your new API moves the latter two methods to the InputSplit.  Can you motivate this?

I question whether it's a good idea to move such "policy" methods to an "implementation" class
like InputSplit.  It seems to me that we'll want to use inheritance to implement InputSplits,
and inheritance can fight with implementation of interfaces.  A typical application will want
to be able to orthogonally define its Mapper and its splitter/RecordReader, and we want to
make it as simple as possible to paste such independent implementations together.  Splitter
and RecordReader implementations will often go together, so it makes sense to have them share
implementations.  But Mappers are frequently independent.  How, using the above, would one
define a single mapper that operates over inputs in different formats, but that produce compatible
keys and values (i.e., merging or joining)?  One should be able to do that by specifying some
sort of compound input format, and only a single mapper implementation.  One should be able
to extend mappers and splitters independently and then glue them together at the last minute.
 Attaching mappers to the split instance seems like it could complicate that.

Before we agree on these APIs, I'd like to see some APIs for both reusable splitters and record
readers as well as sample application code that uses mappers.  Perhaps we should start a wiki
page with code examples given various APIs?

> should allow to specify different inputformat classes for different input dirs for Map/Reduce
> --------------------------------------------------------------------------------------------------
>                 Key: HADOOP-372
>                 URL: https://issues.apache.org/jira/browse/HADOOP-372
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.4.0
>         Environment: all
>            Reporter: Runping Qi
>            Assignee: Owen O'Malley
> Right now, the user can specify multiple input directories for a map reduce job. 
> However, the files under all the directories are assumed to be in the same format, 
> with the same key/value classes. This proves to be  a serious limit in many situations.

> Here is an example. Suppose I have three simple tables: 
> one has URLs and their rank values (page ranks), 
> another has URLs and their classification values, 
> and the third one has the URL meta data such as crawl status, last crawl time, etc. 
> Suppose now I need a job to generate a list of URLs to be crawled next. 
> The decision depends on the info in all the three tables.
> Right now, there is no easy way to accomplish this.
> However, this job can be done if the framework allows to specify different inputformats
for different input dirs.
> Suppose my three tables are in the following directory respectively: rankTable, classificationTable.
and metaDataTable. 
> If we extend JobConf class with the following method (as Owen suggested to me):
>     addInputPath(aPath, anInputFormatClass, anInputKeyClass, anInputValueClass)
> Then I can specify my job as follows:
>     addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, DoubleWritable.class)
>     addInputPath(classificationTable, TextInputFormat.class, UTF8,class, UTF8.class)
>     addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, MyRecord.class)
> If an input directory is added through the current API, it will have the same meaning
as it is now. 
> Thus this extension will not affect any applications that do not need this new feature.
> It is relatively easy for the M/R framework to create an appropriate record reader for
a map task based on the above information.
> And that is the only change needed for supporting this extension.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message