hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Zeyliger (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6039) Computing Input Splits on the MR Cluster
Date Sun, 14 Jun 2009 22:59:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719350#action_12719350
] 

Philip Zeyliger commented on HADOOP-6039:
-----------------------------------------

The motivation behind computing the input splits on the cluster is at least two-fold:
 * It would be great to be able to submit jobs to a cluster using a simple (REST?) API, from
many languages.  (Similar to HADOOP-5633.)  The fact that job submission does a bunch of mapreduce-internal
work makes such submission very tricky.  We're already seeing how workflow systems (here I'm
thinking of Oozie and Pig) run MR jobs simply to launch more MR jobs, while inheriting the
scheduling and isolation work that the JobTracker already does.
 * Sometimes computing the input splits is, in of itself, an operation that would do well
to be run in parallel across several machines.  For example, splitting inputs may require
going through many files on the DFS.  Moving input split calculations onto the cluster would
pave the way for this to be possible.

Implementation-wise, we already have JOB_SETUP and JOB_CLEANUP tasks, so adding a JOB_SPLIT_CALCULATION,
which could be colocated with JOB_SETUP makes some sense.

> Computing Input Splits on the MR Cluster
> ----------------------------------------
>
>                 Key: HADOOP-6039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6039
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Philip Zeyliger
>
> Instead of computing the input splits as part of job submission, Hadoop could have a
separate "job task type" that computes the input splits, therefore allowing that computation
to happen on the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message