hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5303) Hadoop Workflow System (HWS)
Date Mon, 23 Feb 2009 23:56:02 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676133#action_12676133

Tom White commented on HADOOP-5303:

It would be useful to clarify the goals a bit. For example, is the aim to be language independent,
so one can launch workflows from any programming language? Does this need to be a server-side
workflow scheduler, or would a client-side scheduler be sufficient (for the first release
at least)?

One of the stated goals is simplicity, so I wonder if there are some simpler approaches that
should be considered. For example:

* Can we use Ant? There are already some Ant tasks for interacting with Hadoop filesystems,
so would adding some Ant tasks for submitting Hadoop and Pig jobs, retrieving counters, etc,
provide enough control for running dependent jobs? Ant itself comes with an SSH task, HTTP
GET task, and a mail task.

* Another approach might be to look at extending JobControl. It has the notion of a Job (implemented
in org.apache.hadoop.mapred.jobcontrol.Job) which is currently tied to MapReduce jobs, which
could be generalized to running Pig, and perhaps other operations.

* Are there existing workflow engines that fulfill the requirements, or are at least close
enough for us to use or extend? (I notice one of the goals is to "Leverage existing expertise,
concepts and components whenever possible".) Has anyone evaluated what's out there? It would
be useful to do this exercise before implementing anything, I think.

Finally, a few comments on the spec itself:

* The "hadoop" action might be better called "map-reduce" since it runs a MapReduce job. "Hadoop"
is the name of the whole project.

* The "hdfs" action is not really confined to HDFS, but should be able to use any Hadoop filesystem,
such as KFS or S3, so it would be better to call it "fs". Also, there are other places in
the spec where HDFS can be generalized to be any Hadoop filesystem.

* Is there a way to query a workflow's progress to get a percentage complete? Would the details
or list operation do this?

> Hadoop Workflow System (HWS)
> ----------------------------
>                 Key: HADOOP-5303
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5303
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: hws-preso-v1_0_2009FEB22.pdf, hws-v1_0_2009FEB22.pdf
> This is a proposal for a system specialized in running Hadoop/Pig jobs in a control dependency
DAG (Direct Acyclic Graph), a Hadoop workflow application.
> Attached there is a complete specification and a high level overview presentation.
> ----
> *Highlights* 
> A Workflow application is DAG that coordinates the following types of actions: Hadoop,
Pig, Ssh, Http, Email and sub-workflows. 
> Flow control operations within the workflow applications can be done using decision,
fork and join nodes. Cycles in workflows are not supported.
> Actions and decisions can be parameterized with job properties, actions output (i.e.
Hadoop counters, Ssh key/value pairs output) and file information (file exists, file size,
etc). Formal parameters are expressed in the workflow definition as {{${VAR}}} variables.
> A Workflow application is a ZIP file that contains the workflow definition (an XML file),
all the necessary files to run all the actions: JAR files for Map/Reduce jobs, shells for
streaming Map/Reduce jobs, native libraries, Pig scripts, and other resource files.
> Before running a workflow job, the corresponding workflow application must be deployed
in HWS.
> Deploying workflow application and running workflow jobs can be done via command line
tools, a WS API and a Java API.
> Monitoring the system and workflow jobs can be done via a web console, command line tools,
a WS API and a Java API.
> When submitting a workflow job, a set of properties resolving all the formal parameters
in the workflow definitions must be provided. This set of properties is a Hadoop configuration.
> Possible states for a workflow jobs are: {{CREATED}}, {{RUNNING}}, {{SUSPENDED}}, {{SUCCEEDED}},
{{KILLED}} and {{FAILED}}.
> In the case of a action failure in a workflow job, depending on the type of failure,
HWS will attempt automatic retries, it will request a manual retry or it will fail the workflow
> HWS can make HTTP callback notifications on action start/end/failure events and workflow
end/failure events.
> In the case of workflow job failure, the workflow job can be resubmitted skipping previously
completed actions. Before doing a resubmission the workflow application could be updated with
a patch to fix a problem in the workflow application code.
> ----

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message