Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 30B304B11 for ; Wed, 29 Jun 2011 19:14:26 +0000 (UTC) Received: (qmail 9800 invoked by uid 500); 29 Jun 2011 19:14:25 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 9629 invoked by uid 500); 29 Jun 2011 19:14:24 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 9616 invoked by uid 99); 29 Jun 2011 19:14:24 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jun 2011 19:14:24 +0000 Received: from localhost (HELO 140-182-128-39.dhcp-bl.indiana.edu) (127.0.0.1) (smtp-auth username smarru, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jun 2011 19:14:24 +0000 Subject: Re: [VOTE] Oozie to join the Incubator Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=windows-1252 From: Suresh Marru In-Reply-To: <1309374626.44151.YahooMailRC@web161315.mail.bf1.yahoo.com> Date: Wed, 29 Jun 2011 15:14:22 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1309374626.44151.YahooMailRC@web161315.mail.bf1.yahoo.com> To: general@incubator.apache.org X-Mailer: Apple Mail (2.1084) Hi Mohammad, I am interested to contribute to this project, since any one did not = vote yet, can I add my name to the Initial Committers?=20 Thanks, Suresh On Jun 29, 2011, at 3:10 PM, Mohammad Islam wrote: > Hi All, >=20 > The discussion about Oozie proposal is settling down. Therefore I = would like to=20 > initiate a vote to accept Oozie as an Apache Incubator project. >=20 > The latest proposal is pasted at the end and it could be found in the = wiki as=20 > well: >=20 > http://wiki.apache.org/incubator/OozieProposal >=20 >=20 > The related discussion thread is at: > http://www.mail-archive.com/general@incubator.apache.org/msg29633.html >=20 >=20 > Please cast your votes: >=20 > [ ] +1 Accept Oozie for incubation > [ ] +0 Indifferent to Oozie incubation > [ ] -1 Reject Oozie for incubation >=20 > This vote will close 72 hours from now. >=20 > Regards, > Mohammad >=20 >=20 > Abstract > Oozie is a server-based workflow scheduling and coordination system to = manage=20 > data processing jobs for Apache HadoopTM.=20 >=20 > Proposal > Oozie is an extensible, scalable and reliable system to define, = manage,=20 > schedule, and execute complex Hadoop workloads via web services. More = =20 > specifically, this includes:=20 >=20 > * XML-based declarative framework to specify a job or a complex = workflow of=20 > dependent jobs.=20 >=20 > * Support different types of job such as Hadoop Map-Reduce, = Pipe, Streaming,=20 > Pig, Hive and custom java applications.=20 >=20 > * Workflow scheduling based on frequency and/or data = availability.=20 > * Monitoring capability, automatic retry and failure handing of = jobs.=20 > * Extensible and pluggable architecture to allow arbitrary grid = programming=20 > paradigms.=20 >=20 > * Authentication, authorization, and capacity-aware load = throttling to allow=20 > multi-tenant software as a service.=20 >=20 > Background > Most data processing applications require multiple jobs to achieve = their goals, =20 > with inherent dependencies among the jobs. A dependency could be = sequential,=20 > where one job can only start after another job has finished. Or it = could be=20 > conditional, where the execution of a job depends on the return value = or status=20 > of another job. In other cases, parallel execution of multiple jobs = may be=20 > permitted =96 or desired =96 to exploit the massive pool of compute = nodes provided=20 > by Hadoop.=20 >=20 > These job dependencies are often expressed as a Directed Acyclic = Graph, also =20 > called a workflow. A node in the workflow is typically a job (a = computation on=20 > the grid) or another type of action such as an eMail notification. = Computations=20 > can be expressed in map/reduce, Pig, Hive or any other programming = paradigm=20 > available on the grid. Edges of the graph represent transitions from = one node=20 > to the next, as the execution of a workflow proceeds.=20 >=20 > Describing a workflow in a declarative way has the advantage of = decoupling job =20 > dependencies and execution control from application logic. = Furthermore, the=20 > workflow is modularized into jobs that can be reused within the same = workflow=20 > or across different workflows. Execution of the workflow is then = driven by a=20 > runtime system without understanding the application logic of the = jobs. This=20 > runtime system specializes in reliable and predictable execution: It = can retry=20 > actions that have failed or invoke a cleanup action after termination = of the=20 > workflow; it can monitor progress, success, or failure of a workflow, = and send=20 > appropriate alerts to an administrator. The application developer is = relieved=20 > from implementing these generic procedures.=20 >=20 > Furthermore, some applications or workflows need to run in periodic = intervals=20 > or when dependent data is available. For example, a workflow could be = executed=20 > every day as soon as output data from the previous 24 instances of = another,=20 > hourly workflow is available. The workflow coordinator provides such = scheduling=20 > features, along with prioritization, load balancing and throttling to = optimize=20 > utilization of resources in the cluster. This makes it easier to = maintain,=20 > control, and coordinate complex data applications.=20 >=20 > Nearly three years ago, a team of Yahoo! developers addressed these = critical =20 > requirements for Hadoop-based data processing systems by developing a = new=20 > workflow management and scheduling system called Oozie. While it was = initially=20 > developed as a Yahoo!-internal project, it was designed and = implemented with=20 > the intention of open-sourcing. Oozie was released as a GitHub project = in early=20 > 2010. Oozie is used in production within Yahoo and since it has been=20= > open-sourced it has been gaining adoption with external developers=20 >=20 > Rationale > Commonly, applications that run on Hadoop require multiple Hadoop = jobs in order=20 > to obtain the desired results. Furthermore, these Hadoop jobs are = commonly a=20 > combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes =20= > map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs = and shell=20 > scripts.=20 >=20 > Because of this, developers find themselves writing ad-hoc glue = programs to =20 > combine these Hadoop jobs. These ad-hoc programs are difficult to = schedule,=20 > manage, monitor and recover.=20 >=20 > Workflow management and scheduling is an essential feature for = large-scale data =20 > processing applications. Such applications could write the customized = solution=20 > that would require separate development, operational, and maintenance = overhead.=20 > Since it is a prevalent use-case for data processing, the application = developer=20 > would surely prefer a generalized solution with little or no such = overhead.=20 > Oozie addresses the challenge by providing an execution framework to = flexibly=20 > specify the job dependency, data dependency, and time dependency. In = addition,=20 > Oozie provides a multi-tenant-based centralized service and the = opportunity to =20 > optimize load and utilization while respecting SLAs.=20 >=20 > Oozie is built on Apache HadoopTM to schedule jobs related to various = Apache=20 > projects such as Hadoop, Pig, and Hive. As an Apache Open source = project, Oozie=20 > is expected to attract the larger and more diversified community that = currently=20 > uses such Apache sponsored projects. Additionally, users of the = Hadoop =20 > ecosystem can influence Oozie=92s roadmap, and contribute to it. = Likewise, Oozie,=20 > as part of the Apache Hadoop TMecosystem, will be a great benefit to = the current=20 > Hadoop/Pig/Hive/HBase/HCatalog community.=20 >=20 > Current Status > Meritocracy > Oozie currently is a github-based open sourced project where = developers from =20 > multiple companies are contributing to the project. Our intent with = this =20 > incubator proposal is to further extend this diverse developer = community around=20 > Oozie following the Apache meritocracy model. We plan to continue to = provide=20 > adequate support to new developers and to quickly recruit those who = make solid=20 > contributions to committer status. In addition, Oozie will expect, = accept, and=20 > work to attract contributions from amateurs as well.=20 >=20 > Community > While an efficient workflow management and scheduling system is = critical for =20 > large companies with huge data processing in multi-tenant clusters, it = is=20 > equally necessary for any non-trivial deployment. Different companies = are=20 > currently using Oozie as a workflow scheduler for Hadoop-based data = processing.=20 > At Yahoo! it is being used extensively in production clusters to = process=20 > thousand of jobs. Like the Oozie user community, the Oozie developer = community=20 > is also very strong. Developers from Yahoo! provided the initial code = base, and=20 > they are still the most active contributors. In late 2010, developers = from=20 > Cloudera also started contributing, and currently other companies = (e.g., IBM)=20 > are beginning to participate.=20 >=20 > We currently use JIRA for issue tracking, github for code hosting and = Yahoo!=20 > Groups for developer and user communications.=20 >=20 > Core Developers > Oozie is currently being designed and developed by four engineers = from Yahoo! =96 =20 > Mohammad Islam, Angelo Huang, Mayank Bansal, and Andreas Neumann. In = addition,=20 > many outside contributors are actively contributing in design and = development.=20 > Among them, Alejandro Abdelnur from Cloudera and Chao Wang from IBM = are very=20 > important contributors. All of these core developers have deep = expertise in=20 > Hadoop and the Hadoop Ecosystem in general.=20 >=20 > Alignment > The ASF is a natural host for Oozie given that it is already the home = of=20 > Hadoop, Pig, Hive, and other emerging cloud software projects. Oozie = was =20 > designed to support Hadoop from the beginning in order to solve data = processing=20 > challenges in Hadoop clusters. Oozie complements the existing Apache = cloud=20 > computing projects by providing a flexible framework for managing = complex data=20 > processing tasks.=20 >=20 > Known Risks > Orphaned Products > The core developers plan to work full time on the project. There is = very little =20 > risk of Oozie getting orphaned since large companies like Yahoo! are =20= > extensively using it on their production Hadoop clusters. For example, = there=20 > are nearly 400 Yahoo! internal Oozie users and thousands of jobs are = processed=20 > hourly through Oozie in production. In addition, there are nearly 400 = active=20 > users (including Yahoo! internal and external) in the email community = where=20 > nearly 15 emails are exchanged per day. Furthermore, there were more = than 1500=20 > downloads of the Oozie binary in the last eight months from the = github site and=20 > a large number of downloads were conducted by other companies such as = Cloudera.=20 > Oozie has three major releases and more than 15 patch releases in the = last=20 > couple of years which further demonstrates Oozie as a very active = project. We =20 > plan to extend and diversify this community further through Apache.=20 >=20 > Inexperience with Open Source > The core developers are all active users and followers of open = source. They are =20 > already committers and contributors to the Oozie Github project. In = addition,=20 > they are very familiar with Apache principals and philosophy for = community=20 > driven software development.=20 >=20 > Homogeneous Developers > The core developers are from Yahoo! as well as from several other = corporations,=20 > including Cloudera and IBM.=20 >=20 > Reliance on Salaried Developers > Currently, the developers are paid to do work on Oozie. Companies = like Yahoo!=20 > and Cloudera are invested in Oozie as the solution to the workflow = management=20 > and scheduling problem in Hadoop clusters, and that is not likely to = change. In=20 > addition, since workflow management is very important for most hadoop = based=20 > data processing, non-salaried developers and researchers from various=20= > institutes are expected to contribute to the project.=20 >=20 > Relationships with Other Apache Products > Oozie is based on Apache Hadoop to manage jobs created by different = Apache =20 > projects such as Hadoop, Pig, and Hive. Users of these products are = extensively=20 > using Oozie as their workflow scheduler.=20 >=20 > An Excessive Fascination with the Apache Brand > We deeply respect the reputation of Apache and have had great success = with=20 > other Apache projects such as Pig and HCatalog. We are motivated to = expand and =20 > increase the adoption and development of Oozie following Apache=92s = established=20 > open source model. We have also given reasons in the Rationale and = Alignment=20 > sections.=20 >=20 > Documentation > Information about Oozie can be found at = http://yahoo.github.com/oozie/. The=20 > following links provide more information about Oozie in open source:=20= >=20 > * Codebase at GitHub: https://github.com/yahoo/oozie.=20 > * JIRA : http://oozie-jira.hadoop.developer.yahoo.net=20 > * Continuous Integration (CI) build:=20 > http://oozie-ci.hadoop.developer.yahoo.net/=20 >=20 > * Yahoo user community: = http://tech.groups.yahoo.com/group/Oozie-users/=20 > Initial Source > Oozie has been under development since 2009 by a team of engineers at = Yahoo!. It=20 > is currently hosted on GitHub under an Apache license at=20 > https://github.com/yahoo/oozie.=20 >=20 > External Dependencies > The required external dependencies are all Apache License or = compatible=20 > licenses. Following the components with non-Apache licenses are = enumerated:=20 >=20 > * HSQLDB License: HSQLDB=20 > * JDOM license: JDOM=20 > * BSD: Serp=20 > * CCDL v1: jaxb-api, ejb, JAF=20 > NOTE: With the exception of HSQLDB and JDOM that are directly used by = Oozie, =20 > the other listed components are transitive dependencies of other = Apache =20 > components used by Oozie.=20 >=20 > Cryptography > Oozie supports the Kerberos authentication mechanism to access secured = Hadoop=20 > services.=20 >=20 > Required Resources > Mailing Lists > * oozie-private for private PMC discussions (with moderated = subscriptions)=20 > * oozie-dev=20 > * oozie-commits=20 > * oozie-user=20 > Subversion Directory > https://svn.apache.org/repos/asf/incubator/oozie=20 > Issue Tracking > JIRA Oozie (OOZIE)=20 > Other Resources > The existing code already has unit tests, so we would like a Hudson = instance =20 > to run them whenever a new patch is submitted. This can be added after = project=20 > creation.=20 >=20 > Initial Committers > * Mohammad K Islam (mislam77 at yahoo dot com)=20 > * Angelo K Huang (angelohuang at gmail dot com)=20 > * Mayank Bansal (mabansal at gmail dot com)=20 > * Andreas Neumann (neunand at gmail dot com)=20 > * Alejandro Abdelnur (tucu00 at gmail dot com)=20 > * Chao Wang (brookwc at gmail dot com)=20 > Affiliations > * Mohammad K Islam (Yahoo!)=20 > * Angelo Huang (Yahoo!)=20 > * Mayank Bansal (Yahoo!)=20 > * Andreas Neumann (Yahoo!)=20 > * Alejandro Abdelnur (Cloudera)=20 > * Chao Wang (IBM)=20 > Sponsors > Champion > Alan Gates=20 > Nominated Mentors > * Owen O'Malley (Incubator PMC member)=20 > * Alan Gates (Incubator PMC member)=20 > * Christopher Douglas(Incubator PMC member)=20 > * Devaraj Das (Hadoop PMC member)=20 > Sponsoring EntityWe are requesting the Incubator to sponsor this = project.=20 --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org