Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 87847 invoked from network); 25 Apr 2008 03:03:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Apr 2008 03:03:31 -0000 Received: (qmail 71995 invoked by uid 500); 25 Apr 2008 03:03:27 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 71510 invoked by uid 500); 25 Apr 2008 03:03:26 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 71491 invoked by uid 99); 25 Apr 2008 03:03:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Apr 2008 20:03:26 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Apr 2008 03:02:50 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2A034234C103 for ; Thu, 24 Apr 2008 19:59:56 -0700 (PDT) Message-ID: <599085447.1209092396157.JavaMail.jira@brutus> Date: Thu, 24 Apr 2008 19:59:56 -0700 (PDT) From: "Alejandro Abdelnur (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-3149) supporting multiple outputs for M/R jobs In-Reply-To: <1166162754.1207040907546.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated HADOOP-3149: --------------------------------------- Attachment: patch3149.txt Fixing javadocs. Regarding the removal of Writable and WritableComparable replacing them with Object: If I understand things correctly, all API used directly from mapper/reducer code should be typed or generified to enforce as much type safety as possible at compile time. Internal API are more lax. If my assumption is correct then this API should use the Writable/WritableComparable and generics. > supporting multiple outputs for M/R jobs > ---------------------------------------- > > Key: HADOOP-3149 > URL: https://issues.apache.org/jira/browse/HADOOP-3149 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Environment: all > Reporter: Alejandro Abdelnur > Assignee: Alejandro Abdelnur > Fix For: 0.18.0 > > Attachments: patch3149.txt, patch3149.txt, patch3149.txt, patch3149.txt, patch3149.txt, patch3149.txt, patch3149.txt, patch3149.txt > > > The outputcollector supports writing data to a single output, the 'part' files in the output path. > We found quite common that our M/R jobs have to write data to different output. For example when classifying data as NEW, UPDATE, DELETE, NO-CHANGE to later do different processing on it. > Handling the initialization of additional outputs from within the M/R code complicates the code and is counter intuitive with the notion of job configuration. > It would be desirable to: > # Configure the additional outputs in the jobconf, potentially specifying different outputformats, key and value classes for each one. > # Write to the additional outputs in a similar way as data is written to the outputcollector. > # Support the speculative execution semantics for the output files, only visible in the final output for promoted tasks. > To support multiple outputs the following classes would be added to mapred/lib: > * {{MOJobConf}} : extends {{JobConf}} adding methods to define named outputs (name, outputformat, key class, value class) > * {{MOOutputCollector}} : extends {{OutputCollector}} adding a {{collect(String outputName, WritableComparable key, Writable value)}} method. > * {{MOMapper}} and {{MOReducer}}: implement {{Mapper}} and {{Reducer}} adding a new {{configure}}, {{map}} and {{reduce}} signature that take the corresponding {{MO}} classes and performs the proper initialization. > The data flow behavior would be: key/values written to the default (unnamed) output (using the original OutputCollector {{collect}} signature) take part of the shuffle/sort/reduce processing phases. key/values written to a named output from within a map don't. > The named output files would be named using the task type and task ID to avoid collision among tasks (i.e. 'new-m-00002' and 'new-r-00001'). > Together with the setInputPathFilter feature introduced by HADOOP-2055 it would become very easy to chain jobs working on particular named outputs within a single directory. > We are using heavily this pattern and it greatly simplified our M/R code as well as chaining different M/R. > We wanted to contribute this back to Hadoop as we think is a generic feature many could benefit from. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.