Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 44CA24DF9 for ; Wed, 29 Jun 2011 10:23:29 +0000 (UTC) Received: (qmail 91209 invoked by uid 500); 29 Jun 2011 10:23:22 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 90318 invoked by uid 500); 29 Jun 2011 10:23:08 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 90276 invoked by uid 99); 29 Jun 2011 10:23:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jun 2011 10:23:04 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS,T_URIBL_SEM_FRESH_15 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rgardler@opendirective.com designates 209.85.214.175 as permitted sender) Received: from [209.85.214.175] (HELO mail-iw0-f175.google.com) (209.85.214.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jun 2011 10:23:00 +0000 Received: by iwn4 with SMTP id 4so990603iwn.6 for ; Wed, 29 Jun 2011 03:22:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=opendirective.com; s=opendirective; h=mime-version:x-originating-ip:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=d3prEkHgINQvsnQjzzLr8rdHld0/EQY28gtJfDksALE=; b=f95S84Q9xCznB2BMwhECr1qlZAhNKcjpqyOCIHY8/hHrCG7TXC8xoQRCTjbaUg7+0V Tqfa9rXMfYB7YGfCzsfIhTJMp9TMeZzgBUqNmMpXs58b482oLTdcGj1spFBxfFo22Nef F/fcIwmnJFtRi4vcIGCc1t4T4MMD0PERJVVJA= MIME-Version: 1.0 Received: by 10.42.159.68 with SMTP id k4mr622210icx.117.1309342959289; Wed, 29 Jun 2011 03:22:39 -0700 (PDT) Received: by 10.42.228.200 with HTTP; Wed, 29 Jun 2011 03:22:39 -0700 (PDT) X-Originating-IP: [90.221.78.251] In-Reply-To: References: <680911.36089.qm@web161309.mail.bf1.yahoo.com> Date: Wed, 29 Jun 2011 11:22:39 +0100 Message-ID: Subject: Re: [PROPOSAL] Oozie for the Apache Incubator From: Ross Gardler To: general@incubator.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable You might want to reconsider the name. In English (British English at least) "ooze" is an unpleasant thing often related to a body wound or a stagnant river. The formal definition is not so bad [1], but in common (UK) usage it's unpleasant. Ross [1] http://dictionary.reference.com/browse/ooze On 29 June 2011 03:07, arvind@cloudera.com wrote: > +1 (non-binding). > > Thanks, > Arvind > > On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam wro= te: >> Hi, >> >> I would like to propose Oozie to be an Apache Incubator project. >> Oozie is a server-based workflow scheduling and coordination system to m= anage >> data processing jobs for Apache Hadoop. >> >> >> Here's a link to the proposal in the Incubator wiki >> http://wiki.apache.org/incubator/OozieProposal >> >> >> I've also pasted the initial contents below. >> >> Regards, >> >> Mohammad Islam >> >> >> Start of Oozie Proposal >> >> Abstract >> Oozie is a server-based workflow scheduling and coordination system to m= anage >> data processing jobs for Apache HadoopTM. >> >> Proposal >> Oozie is an =A0extensible, scalable and reliable system to define, manag= e, >> schedule, =A0and execute complex Hadoop workloads via web services. More >> specifically, this includes: >> >> =A0 =A0 =A0 =A0* XML-based declarative framework to specify a job or a c= omplex workflow of >> dependent jobs. >> >> =A0 =A0 =A0 =A0* Support different types of job such as Hadoop Map-Reduc= e, Pipe, Streaming, >> Pig, Hive and custom java applications. >> >> =A0 =A0 =A0 =A0* Workflow scheduling based on frequency and/or data avai= lability. >> =A0 =A0 =A0 =A0* Monitoring capability, automatic retry and failure hand= ing of jobs. >> =A0 =A0 =A0 =A0* Extensible and pluggable architecture to allow arbitrar= y grid programming >> paradigms. >> >> =A0 =A0 =A0 =A0* Authentication, authorization, and capacity-aware load = throttling to allow >> multi-tenant software as a service. >> >> Background >> Most data =A0processing applications require multiple jobs to achieve th= eir goals, >> with inherent dependencies among the jobs. A dependency could be =A0sequ= ential, >> where one job can only start after another job has finished. =A0Or it co= uld be >> conditional, where the execution of a job depends on the =A0return value= or status >> of another job. In other cases, parallel =A0execution of multiple jobs m= ay be >> permitted =96 or desired =96 to exploit =A0the massive pool of compute n= odes provided >> by Hadoop. >> >> These =A0job dependencies are often expressed as a Directed Acyclic Grap= h, also >> called a workflow. A node in the workflow is typically a job (a =A0compu= tation on >> the grid) or another type of action such as an eMail =A0notification. Co= mputations >> can be expressed in map/reduce, Pig, Hive or =A0any other programming pa= radigm >> available on the grid. Edges of the graph =A0represent transitions from = one node >> to the next, as the execution of a =A0workflow proceeds. >> >> Describing =A0a workflow in a declarative way has the advantage of decou= pling job >> dependencies and execution control from application logic. Furthermore, = =A0the >> workflow is modularized into jobs that can be reused within the same =A0= workflow >> or across different workflows. Execution of the workflow is =A0then driv= en by a >> runtime system without understanding the application =A0logic of the job= s. This >> runtime system specializes in reliable and =A0predictable execution: It = can retry >> actions that have failed or invoke a =A0cleanup action after termination= of the >> workflow; it can monitor =A0progress, success, or failure of a workflow,= and send >> appropriate alerts =A0to an administrator. The application developer is = relieved >> from =A0implementing these generic procedures. >> >> Furthermore, =A0some applications or workflows need to run in periodic i= ntervals >> or =A0when dependent data is available. For example, a workflow could be= =A0executed >> every day as soon as output data from the previous 24 instances =A0of an= other, >> hourly workflow is available. The workflow coordinator =A0provides such = scheduling >> features, along with prioritization, load =A0balancing and throttling to= optimize >> utilization of resources in the =A0cluster. This makes it easier to main= tain, >> control, and coordinate =A0complex data applications. >> >> Nearly =A0three years ago, a team of Yahoo! developers addressed these c= ritical >> requirements for Hadoop-based data processing systems by developing a = =A0new >> workflow management and scheduling system called Oozie. While it was =A0= initially >> developed as a Yahoo!-internal project, it was designed and =A0implement= ed with >> the intention of open-sourcing. Oozie was released as a GitHub project i= n early >> 2010. Oozie is used in production within Yahoo and =A0since it has been >> open-sourced it has been gaining adoption with =A0external developers >> >> Rationale >> Commonly, =A0applications that run on Hadoop require multiple Hadoop job= s in order >> to =A0obtain the desired results. Furthermore, these Hadoop jobs are com= monly =A0a >> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes >> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs =A0= and shell >> scripts. >> >> Because =A0of this, developers find themselves writing ad-hoc glue progr= ams to >> combine these Hadoop jobs. These ad-hoc programs are difficult to =A0sch= edule, >> manage, monitor and recover. >> >> Workflow =A0management and scheduling is an essential feature for large-= scale data >> processing applications. Such applications could write the customized = =A0solution >> that would require separate development, operational, and =A0maintenance= overhead. >> Since it is a prevalent use-case for data =A0processing, the application= developer >> would surely prefer a generalized =A0solution with little or no such ove= rhead. >> Oozie addresses the challenge =A0by providing an execution framework to = flexibly >> specify the job =A0dependency, data dependency, and time dependency. In = addition, >> Oozie =A0provides a multi-tenant-based centralized service and the oppor= tunity to >> optimize load and utilization while respecting SLAs. >> >> Oozie =A0is built on Apache Hadoop to schedule jobs related to various A= pache >> projects such as Hadoop, Pig, and Hive. As an Apache Open source =A0proj= ect, Oozie >> is expected to attract the larger and more diversified =A0community that= currently >> uses such Apache sponsored projects. =A0Additionally, users of the Hadoo= p >> ecosystem can influence Oozie=92s =A0roadmap, and contribute to it. Like= wise, Oozie, >> as part of the Apache =A0Hadoop ecosystem, will be a great benefit to th= e current >> Hadoop/Pig/Hive/HBase/HCatalog community. >> >> Current Status >> Meritocracy >> Oozie =A0currently is a github-based open sourced project where develope= rs from >> multiple companies are contributing to the project. Our intent with this >> incubator proposal is to further extend this diverse developer =A0commun= ity around >> Oozie following the Apache meritocracy model. We plan =A0to continue to = provide >> adequate support to new developers and to quickly =A0recruit those who m= ake solid >> contributions to committer status. In =A0addition, Oozie will expect, ac= cept, and >> work to attract contributions =A0from amateurs as well. >> >> Community >> While an =A0efficient workflow management and scheduling system is criti= cal for >> large companies with huge data processing in multi-tenant clusters, it = =A0is >> equally necessary for any non-trivial deployment. Different companies = =A0are >> currently using Oozie as a workflow scheduler for Hadoop-based data =A0p= rocessing. >> At Yahoo! it is being used extensively in production =A0clusters to proc= ess >> thousand of jobs. Like the Oozie user community, the =A0Oozie developer = community >> is also very strong. Developers from Yahoo! =A0provided the initial code= base, and >> they are still the most active =A0contributors. In late 2010, developers= from >> Cloudera also started =A0contributing, and currently other companies (e.= g., IBM) >> are beginning to =A0participate. >> >> We currently use JIRA for issue tracking, github for code hosting and Ya= hoo! >> Groups for developer and user communications. >> >> Core Developers >> Oozie is =A0currently being designed and developed by four engineers fro= m Yahoo! =96 >> Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In =A0= addition, >> many outside contributors are actively contributing in design =A0and dev= elopment. >> Among them, Alejandro Abdelnur from Cloudera and Chao =A0Wang from IBM a= re very >> important contributors. All of these core =A0developers have deep expert= ise in >> Hadoop and the Hadoop Ecosystem in =A0general. >> >> Alignment >> The ASF is a =A0natural host for Oozie given that it is already the home= of >> Hadoop, =A0Pig, Hive, and other emerging cloud software projects. Oozie = was >> designed to support Hadoop from the beginning in order to solve data =A0= processing >> challenges in Hadoop clusters. Oozie complements the existing =A0Apache = cloud >> computing projects by providing a flexible framework for =A0managing com= plex data >> processing tasks. >> >> Known Risks >> Orphaned Products >> The core =A0developers plan to work full time on the project. There is v= ery little >> risk of Oozie getting orphaned since large companies like Yahoo! are >> extensively using it on their production Hadoop clusters. For example, = =A0there >> are nearly 400 Yahoo! internal Oozie users and thousands of jobs =A0are = processed >> hourly through Oozie in production. In addition, there are =A0nearly 400= active >> users (including Yahoo! internal and external) in the =A0email community= where >> nearly 15 emails are exchanged per day. =A0Furthermore, there were more = than 1500 >> downloads of the Oozie binary in =A0the last eight months from the githu= b site and >> a large number of =A0downloads were conducted by other companies such as= Cloudera. >> Oozie has =A0three major releases and more than 15 patch releases in the= last >> couple =A0of years which further demonstrates Oozie as a very active pro= ject. We >> plan to extend and diversify this community further through Apache. >> >> Inexperience with Open Source >> The core =A0developers are all active users and followers of open source= . They are >> already committers and contributors to the Oozie Github project. In =A0a= ddition, >> they are very familiar with Apache principals and philosophy =A0for comm= unity >> driven software development. >> >> Homogeneous Developers >> The core developers are from Yahoo! as well as from several other corpor= ations, >> including Cloudera and IBM. >> >> Reliance on Salaried Developers >> Currently, =A0the developers are paid to do work on Oozie. Companies lik= e Yahoo! >> and =A0Cloudera are invested in Oozie as the solution to the workflow = =A0management >> and scheduling problem in Hadoop clusters, and that is not =A0likely to = change. In >> addition, since workflow management is very =A0important for most hadoop= based >> data processing, non-salaried developers =A0and researchers from various >> institutes are expected to contribute to =A0the project. >> >> Relationships with Other Apache Products >> Oozie is =A0based on Apache Hadoop to manage jobs created by different A= pache >> projects such as Hadoop, Pig, and Hive. Users of these products are =A0e= xtensively >> using Oozie as their workflow scheduler. >> >> An Excessive Fascination with the Apache Brand >> We deeply =A0respect the reputation of Apache and have had great success= with >> other =A0Apache projects such as Pig and HCatalog. We are motivated to e= xpand and >> increase the adoption and development of Oozie following Apache=92s =A0e= stablished >> open source model. We have also given reasons in the =A0Rationale and Al= ignment >> sections. >> >> Documentation >> Information about Oozie can be found at http://yahoo.github.com/oozie/. = The >> following links provide more information about Oozie in open source: >> >> =A0 =A0 =A0 =A0* Codebase at GitHub: https://github.com/yahoo/oozie. >> =A0 =A0 =A0 =A0* JIRA : http://oozie-jira.hadoop.developer.yahoo.net >> =A0 =A0 =A0 =A0* Continuous Integration (CI) build: >> http://oozie-ci.hadoop.developer.yahoo.net/ >> >> =A0 =A0 =A0 =A0* Yahoo user community: http://tech.groups.yahoo.com/grou= p/Oozie-users/ >> Initial Source >> Oozie has been under development since 2009 by a team of engineers at Ya= hoo!. It >> is currently hosted on GitHub under an Apache license at >> https://github.com/yahoo/oozie. >> >> External Dependencies >> The required =A0external dependencies are all Apache License or compatib= le >> licenses. =A0Following the components with non-Apache licenses are enume= rated: >> >> =A0 =A0 =A0 =A0* HSQLDB License: HSQLDB >> =A0 =A0 =A0 =A0* JDOM license: JDOM >> =A0 =A0 =A0 =A0* BSD: Serp >> =A0 =A0 =A0 =A0* CCDL v1: jaxb-api, ejb, JAF >> NOTE: =A0With the exception of HSQLDB and JDOM that are directly used by= Oozie, >> the other listed components are transitive dependencies of other Apache >> components used by Oozie. >> >> Cryptography >> Oozie supports the Kerberos authentication mechanism to access secured H= adoop >> services. >> >> Required Resources >> Mailing Lists >> =A0 =A0 =A0 =A0* oozie-private for private PMC discussions (with moderat= ed subscriptions) >> =A0 =A0 =A0 =A0* oozie-dev >> =A0 =A0 =A0 =A0* oozie-commits >> =A0 =A0 =A0 =A0* oozie-user >> Subversion Directory >> https://svn.apache.org/repos/asf/incubator/oozie >> Issue Tracking >> JIRA Oozie (OOZIE) >> Other Resources >> The =A0existing code already has unit tests, so we would like a Hudson i= nstance >> to run them whenever a new patch is submitted. This can be added after = =A0project >> creation. >> >> Initial Committers >> =A0 =A0 =A0 =A0* Mohammad K Islam (mislam77 at yahoo =A0dot com) >> =A0 =A0 =A0 =A0* Angelo K Huang (angelohuang at gmail dot com) >> =A0 =A0 =A0 =A0* Mayank Bansal (mabansal at gmail dot com) >> =A0 =A0 =A0 =A0* Andreas Neumann (neunand at gmail dot com) >> =A0 =A0 =A0 =A0* Alejandro Abdelnur (tucu00 at gmail dot com) >> =A0 =A0 =A0 =A0* Chao Wang (brookwc at gmail dot com) >> Affiliations >> =A0 =A0 =A0 =A0* Mohammad K Islam (Yahoo!) >> =A0 =A0 =A0 =A0* Angelo Huang (Yahoo!) >> =A0 =A0 =A0 =A0* Mayank Bansal (Yahoo!) >> =A0 =A0 =A0 =A0* Andreas Neumann (Yahoo!) >> =A0 =A0 =A0 =A0* Alejandro Abdelnur (Cloudera) >> =A0 =A0 =A0 =A0* Chao Wang (IBM) >> Sponsors >> Champion >> Alan Gates >> Nominated Mentors >> =A0 =A0 =A0 =A0* Owen O'Malley (Incubator PMC member) >> =A0 =A0 =A0 =A0* Alan Gates (Incubator PMC member) >> =A0 =A0 =A0 =A0* Christopher Douglas(Incubator PMC member) >> =A0 =A0 =A0 =A0* Devaraj Das (Hadoop PMC member) >> Sponsoring EntityWe are requesting the Incubator to sponsor this project= . > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > For additional commands, e-mail: general-help@incubator.apache.org > > --=20 Ross Gardler (@rgardler) Programme Leader (Open Development) OpenDirective http://opendirective.com --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org