Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EC26F4D57 for ; Fri, 1 Jul 2011 08:18:51 +0000 (UTC) Received: (qmail 27476 invoked by uid 500); 1 Jul 2011 08:18:45 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 25195 invoked by uid 500); 1 Jul 2011 08:18:32 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 25181 invoked by uid 99); 1 Jul 2011 08:18:26 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jul 2011 08:18:26 +0000 Received: from localhost (HELO mail-ww0-f43.google.com) (127.0.0.1) (smtp-auth username tomwhite, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Jul 2011 08:18:26 +0000 Received: by wwi18 with SMTP id 18so2869370wwi.0 for ; Fri, 01 Jul 2011 01:18:24 -0700 (PDT) Received: by 10.217.5.130 with SMTP id w2mr845218wes.61.1309508304107; Fri, 01 Jul 2011 01:18:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.3.21 with HTTP; Fri, 1 Jul 2011 01:18:04 -0700 (PDT) In-Reply-To: <1309374626.44151.YahooMailRC@web161315.mail.bf1.yahoo.com> References: <1309374626.44151.YahooMailRC@web161315.mail.bf1.yahoo.com> From: Tom White Date: Fri, 1 Jul 2011 09:18:04 +0100 Message-ID: Subject: Re: [VOTE] Oozie to join the Incubator To: general@incubator.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable +1 Tom On Wed, Jun 29, 2011 at 8:10 PM, Mohammad Islam wrote: > Hi All, > > The discussion about Oozie proposal is settling down. Therefore I would l= ike to > initiate a vote to accept Oozie as an Apache Incubator project. > > The latest proposal is pasted at the end and it could be found in the wik= i as > well: > > http://wiki.apache.org/incubator/OozieProposal > > > The related discussion thread is at: > http://www.mail-archive.com/general@incubator.apache.org/msg29633.html > > > Please cast your votes: > > [ =A0] +1 Accept Oozie for incubation > [ =A0] +0 Indifferent to Oozie incubation > [ =A0] -1 Reject Oozie for incubation > > This vote will close 72 hours =A0from now. > > Regards, > Mohammad > > > Abstract > Oozie is a server-based workflow scheduling and coordination system to ma= nage > data processing jobs for Apache HadoopTM. > > Proposal > Oozie is an =A0extensible, scalable and reliable system to define, manage= , > schedule, =A0and execute complex Hadoop workloads via web services. More > specifically, this includes: > > =A0 =A0 =A0 =A0* XML-based declarative framework to specify a job or a co= mplex workflow of > dependent jobs. > > =A0 =A0 =A0 =A0* Support different types of job such as Hadoop Map-Reduce= , Pipe, Streaming, > Pig, Hive and custom java applications. > > =A0 =A0 =A0 =A0* Workflow scheduling based on frequency and/or data avail= ability. > =A0 =A0 =A0 =A0* Monitoring capability, automatic retry and failure handi= ng of jobs. > =A0 =A0 =A0 =A0* Extensible and pluggable architecture to allow arbitrary= grid programming > paradigms. > > =A0 =A0 =A0 =A0* Authentication, authorization, and capacity-aware load t= hrottling to allow > multi-tenant software as a service. > > Background > Most data =A0processing applications require multiple jobs to achieve the= ir goals, > with inherent dependencies among the jobs. A dependency could be =A0seque= ntial, > where one job can only start after another job has finished. =A0Or it cou= ld be > conditional, where the execution of a job depends on the =A0return value = or status > of another job. In other cases, parallel =A0execution of multiple jobs ma= y be > permitted =96 or desired =96 to exploit =A0the massive pool of compute no= des provided > by Hadoop. > > These =A0job dependencies are often expressed as a Directed Acyclic Graph= , also > called a workflow. A node in the workflow is typically a job (a =A0comput= ation on > the grid) or another type of action such as an eMail =A0notification. Com= putations > can be expressed in map/reduce, Pig, Hive or =A0any other programming par= adigm > available on the grid. Edges of the graph =A0represent transitions from o= ne node > to the next, as the execution of a =A0workflow proceeds. > > Describing =A0a workflow in a declarative way has the advantage of decoup= ling job > dependencies and execution control from application logic. Furthermore, = =A0the > workflow is modularized into jobs that can be reused within the same =A0w= orkflow > or across different workflows. Execution of the workflow is =A0then drive= n by a > runtime system without understanding the application =A0logic of the jobs= . This > runtime system specializes in reliable and =A0predictable execution: It c= an retry > actions that have failed or invoke a =A0cleanup action after termination = of the > workflow; it can monitor =A0progress, success, or failure of a workflow, = and send > appropriate alerts =A0to an administrator. The application developer is r= elieved > from =A0implementing these generic procedures. > > Furthermore, =A0some applications or workflows need to run in periodic in= tervals > or =A0when dependent data is available. For example, a workflow could be = =A0executed > every day as soon as output data from the previous 24 instances =A0of ano= ther, > hourly workflow is available. The workflow coordinator =A0provides such s= cheduling > features, along with prioritization, load =A0balancing and throttling to = optimize > utilization of resources in the =A0cluster. This makes it easier to maint= ain, > control, and coordinate =A0complex data applications. > > Nearly =A0three years ago, a team of Yahoo! developers addressed these cr= itical > requirements for Hadoop-based data processing systems by developing a =A0= new > workflow management and scheduling system called Oozie. While it was =A0i= nitially > developed as a Yahoo!-internal project, it was designed and =A0implemente= d with > the intention of open-sourcing. Oozie was released as a GitHub project in= early > 2010. Oozie is used in production within Yahoo and =A0since it has been > open-sourced it has been gaining adoption with =A0external developers > > Rationale > Commonly, =A0applications that run on Hadoop require multiple Hadoop jobs= in order > to =A0obtain the desired results. Furthermore, these Hadoop jobs are comm= only =A0a > combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes > map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs =A0a= nd shell > scripts. > > Because =A0of this, developers find themselves writing ad-hoc glue progra= ms to > combine these Hadoop jobs. These ad-hoc programs are difficult to =A0sche= dule, > manage, monitor and recover. > > Workflow =A0management and scheduling is an essential feature for large-s= cale data > processing applications. Such applications could write the customized =A0= solution > that would require separate development, operational, and =A0maintenance = overhead. > Since it is a prevalent use-case for data =A0processing, the application = developer > would surely prefer a generalized =A0solution with little or no such over= head. > Oozie addresses the challenge =A0by providing an execution framework to f= lexibly > specify the job =A0dependency, data dependency, and time dependency. In a= ddition, > Oozie =A0provides a multi-tenant-based centralized service and the opport= unity to > optimize load and utilization while respecting SLAs. > > Oozie is built on Apache HadoopTM to schedule jobs related to various Apa= che > projects such as Hadoop, =A0Pig, and Hive. As an Apache Open source proje= ct, Oozie > is expected to =A0attract the larger and more diversified community that = currently > uses =A0such Apache sponsored projects. Additionally, users of the Hadoop > ecosystem can influence Oozie=92s roadmap, and contribute to it. Likewise= , =A0Oozie, > as part of the Apache Hadoop TMecosystem, will be a great benefit to the = current > Hadoop/Pig/Hive/HBase/HCatalog community. > > Current Status > Meritocracy > Oozie =A0currently is a github-based open sourced project where developer= s from > multiple companies are contributing to the project. Our intent with this > incubator proposal is to further extend this diverse developer =A0communi= ty around > Oozie following the Apache meritocracy model. We plan =A0to continue to p= rovide > adequate support to new developers and to quickly =A0recruit those who ma= ke solid > contributions to committer status. In =A0addition, Oozie will expect, acc= ept, and > work to attract contributions =A0from amateurs as well. > > Community > While an =A0efficient workflow management and scheduling system is critic= al for > large companies with huge data processing in multi-tenant clusters, it = =A0is > equally necessary for any non-trivial deployment. Different companies =A0= are > currently using Oozie as a workflow scheduler for Hadoop-based data =A0pr= ocessing. > At Yahoo! it is being used extensively in production =A0clusters to proce= ss > thousand of jobs. Like the Oozie user community, the =A0Oozie developer c= ommunity > is also very strong. Developers from Yahoo! =A0provided the initial code = base, and > they are still the most active =A0contributors. In late 2010, developers = from > Cloudera also started =A0contributing, and currently other companies (e.g= ., IBM) > are beginning to =A0participate. > > We currently use JIRA for issue tracking, github for code hosting and Yah= oo! > Groups for developer and user communications. > > Core Developers > Oozie is =A0currently being designed and developed by four engineers from= Yahoo! =96 > Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In =A0a= ddition, > many outside contributors are actively contributing in design =A0and deve= lopment. > Among them, Alejandro Abdelnur from Cloudera and Chao =A0Wang from IBM ar= e very > important contributors. All of these core =A0developers have deep experti= se in > Hadoop and the Hadoop Ecosystem in =A0general. > > Alignment > The ASF is a =A0natural host for Oozie given that it is already the home = of > Hadoop, =A0Pig, Hive, and other emerging cloud software projects. Oozie w= as > designed to support Hadoop from the beginning in order to solve data =A0p= rocessing > challenges in Hadoop clusters. Oozie complements the existing =A0Apache c= loud > computing projects by providing a flexible framework for =A0managing comp= lex data > processing tasks. > > Known Risks > Orphaned Products > The core =A0developers plan to work full time on the project. There is ve= ry little > risk of Oozie getting orphaned since large companies like Yahoo! are > extensively using it on their production Hadoop clusters. For example, = =A0there > are nearly 400 Yahoo! internal Oozie users and thousands of jobs =A0are p= rocessed > hourly through Oozie in production. In addition, there are =A0nearly 400 = active > users (including Yahoo! internal and external) in the =A0email community = where > nearly 15 emails are exchanged per day. =A0Furthermore, there were more t= han 1500 > downloads of the Oozie binary in =A0the last eight months from the github= site and > a large number of =A0downloads were conducted by other companies such as = Cloudera. > Oozie has =A0three major releases and more than 15 patch releases in the = last > couple =A0of years which further demonstrates Oozie as a very active proj= ect. We > plan to extend and diversify this community further through Apache. > > Inexperience with Open Source > The core =A0developers are all active users and followers of open source.= They are > already committers and contributors to the Oozie Github project. In =A0ad= dition, > they are very familiar with Apache principals and philosophy =A0for commu= nity > driven software development. > > Homogeneous Developers > The core developers are from Yahoo! as well as from several other corpora= tions, > including Cloudera and IBM. > > Reliance on Salaried Developers > Currently, =A0the developers are paid to do work on Oozie. Companies like= Yahoo! > and =A0Cloudera are invested in Oozie as the solution to the workflow =A0= management > and scheduling problem in Hadoop clusters, and that is not =A0likely to c= hange. In > addition, since workflow management is very =A0important for most hadoop = based > data processing, non-salaried developers =A0and researchers from various > institutes are expected to contribute to =A0the project. > > Relationships with Other Apache Products > Oozie is =A0based on Apache Hadoop to manage jobs created by different Ap= ache > projects such as Hadoop, Pig, and Hive. Users of these products are =A0ex= tensively > using Oozie as their workflow scheduler. > > An Excessive Fascination with the Apache Brand > We deeply =A0respect the reputation of Apache and have had great success = with > other =A0Apache projects such as Pig and HCatalog. We are motivated to ex= pand and > increase the adoption and development of Oozie following Apache=92s =A0es= tablished > open source model. We have also given reasons in the =A0Rationale and Ali= gnment > sections. > > Documentation > Information about Oozie can be found at http://yahoo.github.com/oozie/. T= he > following links provide more information about Oozie in open source: > > =A0 =A0 =A0 =A0* Codebase at GitHub: https://github.com/yahoo/oozie. > =A0 =A0 =A0 =A0* JIRA : http://oozie-jira.hadoop.developer.yahoo.net > =A0 =A0 =A0 =A0* Continuous Integration (CI) =A0build: > http://oozie-ci.hadoop.developer.yahoo.net/ > > =A0 =A0 =A0 =A0* Yahoo user community: http://tech.groups.yahoo.com/group= /Oozie-users/ > Initial Source > Oozie has been under development since 2009 by a team of engineers at Yah= oo!. It > is currently hosted on GitHub under an Apache license at > https://github.com/yahoo/oozie. > > External Dependencies > The required =A0external dependencies are all Apache License or compatibl= e > licenses. =A0Following the components with non-Apache licenses are enumer= ated: > > =A0 =A0 =A0 =A0* HSQLDB License: HSQLDB > =A0 =A0 =A0 =A0* JDOM license: JDOM > =A0 =A0 =A0 =A0* BSD: Serp > =A0 =A0 =A0 =A0* CCDL v1: jaxb-api, ejb, JAF > NOTE: =A0With the exception of HSQLDB and JDOM that are directly used by = Oozie, > the other listed components are transitive dependencies of other Apache > components used by Oozie. > > Cryptography > Oozie supports the Kerberos authentication mechanism to access secured Ha= doop > services. > > Required Resources > Mailing Lists > =A0 =A0 =A0 =A0* oozie-private for private PMC discussions (with moderate= d subscriptions) > =A0 =A0 =A0 =A0* oozie-dev > =A0 =A0 =A0 =A0* oozie-commits > =A0 =A0 =A0 =A0* oozie-user > Subversion Directory > https://svn.apache.org/repos/asf/incubator/oozie > Issue Tracking > JIRA Oozie (OOZIE) > Other Resources > The =A0existing code already has unit tests, so we would like a Hudson in= stance > to run them whenever a new patch is submitted. This can be added after = =A0project > creation. > > Initial Committers > =A0 =A0 =A0 =A0* Mohammad K Islam (mislam77 at yahoo =A0dot com) > =A0 =A0 =A0 =A0* Angelo K Huang (angelohuang at gmail dot com) > =A0 =A0 =A0 =A0* Mayank Bansal (mabansal at gmail dot com) > =A0 =A0 =A0 =A0* Andreas Neumann (neunand at gmail dot com) > =A0 =A0 =A0 =A0* Alejandro Abdelnur (tucu00 at gmail dot com) > =A0 =A0 =A0 =A0* Chao Wang (brookwc at gmail dot com) > Affiliations > =A0 =A0 =A0 =A0* Mohammad K Islam (Yahoo!) > =A0 =A0 =A0 =A0* Angelo Huang (Yahoo!) > =A0 =A0 =A0 =A0* Mayank Bansal (Yahoo!) > =A0 =A0 =A0 =A0* Andreas Neumann (Yahoo!) > =A0 =A0 =A0 =A0* Alejandro Abdelnur (Cloudera) > =A0 =A0 =A0 =A0* Chao Wang (IBM) > Sponsors > Champion > Alan Gates > Nominated Mentors > =A0 =A0 =A0 =A0* Owen O'Malley (Incubator PMC member) > =A0 =A0 =A0 =A0* Alan Gates (Incubator PMC member) > =A0 =A0 =A0 =A0* Christopher Douglas(Incubator PMC member) > =A0 =A0 =A0 =A0* Devaraj Das (Hadoop PMC member) > Sponsoring EntityWe are requesting the Incubator to sponsor this project. > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org