hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3149) supporting multiple outputs for M/R jobs
Date Tue, 01 Apr 2008 18:20:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584241#action_12584241
] 

Alejandro Abdelnur commented on HADOOP-3149:
--------------------------------------------

Runping,

The MultipleOutpuFormat of HADOOP-2906 is more general that what HADOOP-3149 tries to solve,
and because of that it requires more coding and handling from the Map/Reduce code: 1. the
developer has to take care of avoiding file collision, the developer has to take care of closing
the MultipleOutputFormat files. it is not clear how you define the classes for the key/value
for SequenceFiles.

The MO* API allows to easily define named-outputs with its own configuration (output-format,
key class, value class) and use them on the usual collector way. The MO code takes care of
namespacing the files names, creating and closing them.
 

> supporting multiple outputs for M/R jobs
> ----------------------------------------
>
>                 Key: HADOOP-3149
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3149
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: patch3149.txt
>
>
> The outputcollector supports writing data to a single output, the 'part' files in the
output path.
> We found quite common that our M/R jobs have to write data to different output. For example
when classifying data as NEW, UPDATE, DELETE, NO-CHANGE to later do different processing on
it.
> Handling the initialization of additional outputs from within the M/R code complicates
the code and is counter intuitive with the notion of job configuration.
> It would be desirable to:
> # Configure the additional outputs in the jobconf, potentially specifying different outputformats,
key and value classes for each one.
> # Write to the additional outputs in a similar way as data is written to the outputcollector.
> # Support the speculative execution semantics for the output files, only visible in the
final output for promoted tasks.
> To support multiple outputs the following classes would be added to mapred/lib:
> * {{MOJobConf}} : extends {{JobConf}} adding methods to define named outputs (name, outputformat,
key class, value class)
> * {{MOOutputCollector}} : extends {{OutputCollector}} adding a {{collect(String outputName,
WritableComparable key, Writable value)}} method.
> * {{MOMapper}} and {{MOReducer}}: implement {{Mapper}} and {{Reducer}} adding a new {{configure}},
{{map}} and {{reduce}} signature that take the corresponding {{MO}} classes and performs the
proper initialization.
> The data flow behavior would be: key/values written to the default (unnamed) output (using
the original OutputCollector {{collect}} signature) take part of the shuffle/sort/reduce processing
phases. key/values written to a named output from within a map don't.
> The named output files would be named using the task type and task ID to avoid collision
among tasks (i.e. 'new-m-00002' and 'new-r-00001').
> Together with the setInputPathFilter feature introduced by HADOOP-2055 it would become
very easy to chain jobs working on particular named outputs within a single directory.
> We are using heavily this pattern and it greatly simplified our M/R code as well as chaining
different M/R. 
> We wanted to contribute this back to Hadoop as we think is a generic feature many could
benefit from.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message