Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <933919151.1215314371991.JavaMail.jira@brutus>
Date: Sat, 5 Jul 2008 20:19:31 -0700 (PDT)
From: "Alejandro Abdelnur (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Issue Comment Edited: (HADOOP-3150) Move task file promotion
 into the task
In-Reply-To: <1005772078.1207083747584.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610719#action_12610719 ] 

tucu00 edited comment on HADOOP-3150 at 7/5/08 8:18 PM:
--------------------------------------------------------------------

There are a few different topics being discussed in this issue:

# Changing from JT to Task the responsibility for committing the output of a task
# Making the committing of the output of a task generic, non HDFS specific
# Being able to create side {{OutputStream}} s (not {{RecordWriters}} ) from a task

IMO this issue should only address the *first topic*. The gain of this is freeing the JT from doing the task output commit, leaving to the JT just the coordination of it.

The *third topic*, as it has been suggested it could be address by Hadoop-3149, by adding an static method {{getOutputStream(JobConf conf, String baseName)}}. This method would use the filename namespacing introduced by Hadoop-3149 (previously Hadoop-3258) to create a unique file under the job working output directory. Note that {{MultipleOutputs}} does not implement {{OutputFormat}}, because of this, IMO, we are not overloading it with unrelated behavior; {{MultipleOutputs}} just becomes a mean to create additional outputs, {{OutputFormat}} s or {{OutputStream}} s in the context of the output of a task consistent with the handling of the task output in the case of success completion and failure.

The *second topic* is a whole thing on it own and I think it should be left to its own Jira:

# It should make the commit of a task output independent of HDFS
# It should handle the commit of a task output atomically (at least against every single storage the outputs go)
# It should not leave the commit to the {{OutputFormat}} as jobs can use their own output formats, IMO it should be something like {{TaskOutputCommitter}} for each storage type that is part of the Hadoop code (cannot be set by a job) and is run once per storage instance used by the task (ideally in a transaction like style).


      was (Author: tucu00):
    There are a few different topics being discussed in this issue:

# Changing from JT to Task the responsibility for committing the output of a task
# Making the committing of the output of a task generic, non HDFS specific
# Being able to create side OutputStreams (not RecordWriters) from a task

IMO this issue should only address the *first topic*. The gain of this is freeing the JT from doing the task output commit, leaving to the JT just the coordination of it.

The *third topic*, as it has been suggested it could be address by Hadoop-3149, by adding an static method {{getOutputStream(JobConf conf, String baseName)}}. This method would use the filename namespacing introduced by Hadoop-3149 (previously Hadoop-3258) to create a unique file under the job working output directory. Note that {{MultipleOutputs}} does not implement {{OutputFormat}}, because of this, IMO, we are not overloading it with unrelated behavior; {{MultipleOutputs}} just becomes a mean to create additional outputs, {{OutputFormat}}s or {{OutputStream}}s in the context of the output of a task consistent with the handling of the task output in the case of success completion and failure.

The *second topic* is a whole thing on it own and I think it should be left to its own Jira:

# It should make the commit of a task output independent of HDFS
# It should handle the commit of a task output atomically (at least against every single storage the outputs go)
# It should not leave the commit to the {{OutputFormat}} as jobs can use their own output formats, IMO it should be something like {{TaskOutputCommitter}} for each storage type that is part of the Hadoop code (cannot be set by a job) and is run once per storage instance used by the task (ideally in a transaction like style).

  
> Move task file promotion into the task
> --------------------------------------
>
>                 Key: HADOOP-3150
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3150
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.19.0
>
>         Attachments: 3150.patch
>
>
> We need to move the task file promotion from the JobTracker to the Task and move it down into the output format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.