hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alejandro Abdelnur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5303) Hadoop Workflow System (HWS)
Date Mon, 23 Feb 2009 12:01:02 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675861#action_12675861

Alejandro Abdelnur commented on HADOOP-5303:

Cascading and HWS are different beasts.

Cascading is a different way of doing what Pig does. Programming in Cascading is programming
on a higher level abstraction that resolves in a series of Map/Reduce jobs.

HWS is a (server) workflow system specialized on running Hadoop/Pig jobs wired via a PDL descriptor.

Following a few quick highlights on how Cascading and HWS differ:

h4. Cascading uses a topological search model to resolve the execution path.

HWS uses a 'DAG of processes workflow' model that allows explicitly expressing parallelism
and alternate execution paths (decisions).

h4. Cascading runs as a client from the command line

HWS is a server system (like Hadoop Job Tracker) to which you submit workflow jobs and later
check the status.

In HWS there are not resources held once the client submitted the workflow job, the workflow
job runs in the server.

This allows you to run several thousands of workflow jobs concurrently from a single HWS that
supports system failover.

In HWS monitoring and status tracking of jobs is done via CLIs and a web console that gathers
data from HWS (like you do in Hadoop).

h4. Cascading primary programming model is similar to PIG but with a Java API.

In Cascading you can still use your Hadoop jobs as a flow, as a way to integrate with existing
map/reduce apps, but the real benefit of cascading is by using its API programming model.

HWS primary programming model are Hadoop/Pig jobs connected via a workflow definition PDL
like XML file.

h4. In cascading you need to write Java code to wire your Hadoop jobs

In HWS you don't have to wire your Hadoop/Pig jobs in Java but in a workflow XML file in a
more declarative way.

> Hadoop Workflow System (HWS)
> ----------------------------
>                 Key: HADOOP-5303
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5303
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: hws-preso-v1_0_2009FEB22.pdf, hws-v1_0_2009FEB22.pdf
> This is a proposal for a system specialized in running Hadoop/Pig jobs in a control dependency
DAG (Direct Acyclic Graph), a Hadoop workflow application.
> Attached there is a complete specification and a high level overview presentation.
> ----
> *Highlights* 
> A Workflow application is DAG that coordinates the following types of actions: Hadoop,
Pig, Ssh, Http, Email and sub-workflows. 
> Flow control operations within the workflow applications can be done using decision,
fork and join nodes. Cycles in workflows are not supported.
> Actions and decisions can be parameterized with job properties, actions output (i.e.
Hadoop counters, Ssh key/value pairs output) and file information (file exists, file size,
etc). Formal parameters are expressed in the workflow definition as {{${VAR}}} variables.
> A Workflow application is a ZIP file that contains the workflow definition (an XML file),
all the necessary files to run all the actions: JAR files for Map/Reduce jobs, shells for
streaming Map/Reduce jobs, native libraries, Pig scripts, and other resource files.
> Before running a workflow job, the corresponding workflow application must be deployed
in HWS.
> Deploying workflow application and running workflow jobs can be done via command line
tools, a WS API and a Java API.
> Monitoring the system and workflow jobs can be done via a web console, command line tools,
a WS API and a Java API.
> When submitting a workflow job, a set of properties resolving all the formal parameters
in the workflow definitions must be provided. This set of properties is a Hadoop configuration.
> Possible states for a workflow jobs are: {{CREATED}}, {{RUNNING}}, {{SUSPENDED}}, {{SUCCEEDED}},
{{KILLED}} and {{FAILED}}.
> In the case of a action failure in a workflow job, depending on the type of failure,
HWS will attempt automatic retries, it will request a manual retry or it will fail the workflow
> HWS can make HTTP callback notifications on action start/end/failure events and workflow
end/failure events.
> In the case of workflow job failure, the workflow job can be resubmitted skipping previously
completed actions. Before doing a resubmission the workflow application could be updated with
a patch to fix a problem in the workflow application code.
> ----

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message