hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vuk Ercegovac (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-3926) Multiple, generic InputFormats for MapReduce
Date Fri, 08 Aug 2008 07:00:44 GMT
Multiple, generic InputFormats for MapReduce
--------------------------------------------

                 Key: HADOOP-3926
                 URL: https://issues.apache.org/jira/browse/HADOOP-3926
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Vuk Ercegovac
            Priority: Minor


The feature that allows an InputFormat per path to be specified for a MapReduce job should
be generalized
(see http://issues.apache.org/jira/browse/HADOOP-372) to support InputFormats other than FileInputFormat
(e.g., HBase table). This is needed when joining or co-grouping multiple inputs. Even for
the case of multiple FileInputFormats, it seems that if a sub-class sets and configures itself
from the JobConf, the inputs will need to insure that they do not have name clashes. In general,
the child InputFormats should not be aware of each other.

We've implemented this for Jaql but would like to remove dependencies on other libs (json)
and see how it can be integrated with the HADOOP-372 changes. It works similar to HADOOP-372.
A UnionInputFormat consists of multiple child InputFormats. The UnionInputFormat records an
array of <InputFormat, name-value pairs for JobConf> in the JobConf. For creating splits,
it collects child splits (similar to DelegatingInputFormat) and wraps each child's split with
its index into the array (similar to TaggedInputSplit). The UnionInputFormat, given a split,
can then dig out the corresponding InputFormat given its index, instantiate it, and return
its RecordReader. Each child InputFormat depends on setting up an empty JobConf prior to its
instantiation. An alternative is to use a string version of an InputFormat's setup JobConf.
The analog to DelegatingMapper simply exposes the child split's index to drive per input logic
(in our case, its a script rather than a Map class). As with HADOOP-372, these are lib-level
changes, not core.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message