hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3150) Move task file promotion into the task
Date Sun, 06 Jul 2008 03:16:02 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610719#action_12610719
] 

Alejandro Abdelnur commented on HADOOP-3150:
--------------------------------------------

There are a few different topics being discussed in this issue:

# Changing from JT to Task the responsibility for committing the output of a task
# Making the committing of the output of a task generic, non HDFS specific
# Being able to create side OutputStreams (not RecordWriters) from a task

IMO this issue should only address the *first topic*. The gain of this is freeing the JT from
doing the task output commit, leaving to the JT just the coordination of it.

The *third topic*, as it has been suggested it could be address by Hadoop-3149, by adding
an static method {{getOutputStream(JobConf conf, String baseName)}}. This method would use
the filename namespacing introduced by Hadoop-3149 (previously Hadoop-3258) to create a unique
file under the job working output directory. Note that {{MultipleOutputs}} does not implement
{{OutputFormat}}, because of this, IMO, we are not overloading it with unrelated behavior;
{{MultipleOutputs}} just becomes a mean to create additional outputs, {{OutputFormat}}s or
{{OutputStream}}s in the context of the output of a task consistent with the handling of the
task output in the case of success completion and failure.

The *second topic* is a whole thing on it own and I think it should be left to its own Jira:

# It should make the commit of a task output independent of HDFS
# It should handle the commit of a task output atomically (at least against every single storage
the outputs go)
# It should not leave the commit to the {{OutputFormat}} as jobs can use their own output
formats, IMO it should be something like {{TaskOutputCommitter}} for each storage type that
is part of the Hadoop code (cannot be set by a job) and is run once per storage instance used
by the task (ideally in a transaction like style).


> Move task file promotion into the task
> --------------------------------------
>
>                 Key: HADOOP-3150
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3150
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.19.0
>
>         Attachments: 3150.patch
>
>
> We need to move the task file promotion from the JobTracker to the Task and move it down
into the output format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message