Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B5D2BC64A for ; Thu, 24 May 2012 11:52:52 +0000 (UTC) Received: (qmail 7196 invoked by uid 500); 24 May 2012 11:52:49 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 6977 invoked by uid 500); 24 May 2012 11:52:49 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 6961 invoked by uid 99); 24 May 2012 11:52:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 May 2012 11:52:49 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [198.49.146.77] (HELO smtpksrv1.mitre.org) (198.49.146.77) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 May 2012 11:52:41 +0000 Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1]) by localhost (Postfix) with SMTP id 1112E21B042A for ; Thu, 24 May 2012 07:52:20 -0400 (EDT) Received: from IMCCAS02.MITRE.ORG (imccas02.mitre.org [129.83.29.79]) by smtpksrv1.mitre.org (Postfix) with ESMTP id E9F0621B1AB5 for ; Thu, 24 May 2012 07:52:19 -0400 (EDT) Received: from IMCMBX01.MITRE.ORG ([169.254.1.227]) by IMCCAS02.MITRE.ORG ([129.83.29.79]) with mapi id 14.02.0283.003; Thu, 24 May 2012 07:52:19 -0400 From: "Franklin, Matthew B." To: "general@incubator.apache.org" Subject: RE: [VOTE] Accept Crunch into the Apache Incubator Thread-Topic: [VOTE] Accept Crunch into the Apache Incubator Thread-Index: AQHNORRmw2sC8HrqMEa0sQrMKjTsXpbY1SZw Date: Thu, 24 May 2012 11:52:18 +0000 Message-ID: <2E45169E9A237B4DA78078A68962F9EF43A14E@IMCMBX01.MITRE.ORG> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [129.83.31.58] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 +1 (binding) >-----Original Message----- >From: Josh Wills [mailto:jwills@cloudera.com] >Sent: Wednesday, May 23, 2012 2:46 PM >To: general@incubator.apache.org >Subject: [VOTE] Accept Crunch into the Apache Incubator > >I would like to call a vote for accepting "Apache Crunch" for >incubation in the Apache Incubator. The full proposal is available >below. We ask the Incubator PMC to sponsor it, with phunt as >Champion, and phunt, tomwhite, and acmurthy volunteering to be >Mentors. > >Please cast your vote: > >[ ] +1, bring Crunch into Incubator >[ ] +0, I don't care either way, >[ ] -1, do not bring Crunch into Incubator, because... > >This vote will be open for 72 hours and only votes from the Incubator >PMC are binding. > >http://wiki.apache.org/incubator/CrunchProposal > >Proposal text from the wiki: >--------------------------------------------------------------------------= --------------------- >----------------------- >=3D Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =3D > >=3D=3D Abstract =3D=3D > >Crunch is a Java library for writing, testing, and running pipelines >of !MapReduce jobs on Apache Hadoop. > >=3D=3D Proposal =3D=3D > >Crunch is a Java library for writing, testing, and running pipelines >of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a >high-level API for writing and testing complex !MapReduce jobs that >require multiple processing stages. It has a simple, flexible, and >extensible data model that makes it ideal for processing data that >does not naturally fit into a relational structure, such as time >series and serialized object formats like JSON and Avro. It supports >running pipelines either as a series of !MapReduce jobs on an Apache >Hadoop cluster or in memory on a single machine for fast testing and >debugging. > >=3D=3D Background =3D=3D > >Crunch was initially developed by Cloudera to simplify the process of >creating sequences of dependent !MapReduce jobs, especially jobs that >processed non-relational data like time series. Its design was based >on a paper Google published about a Java library they developed called >!FlumeJava that was created in order to solve a similar class of >problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache >2.0 licensed project in October 2011. During this time Crunch has been >formally released twice, as versions 0.1.0 (October 2010) and 0.2.0 >(February 2012), with an incremental update to version 0.2.1 (March >2012) . These releases are also distributed by Cloudera as source and >binaries from Cloudera's Maven repository. > >=3D=3D Rationale =3D=3D > >Most of the interesting analytical and data processing tasks that are >run on an Apache Hadoop cluster require a series of !MapReduce jobs to >be executed in sequence. Developers who are creating these pipelines >today need to manually assign the sequence of tasks to perform in a >dependent chain of !MapReduce jobs, even though there are a number of >well-known patterns for fusing dependent computations together into a >single !MapReduce stage and for performing common types of joins and >aggregations. This results in !MapReduce pipelines that are more >difficult to test, maintain, and extend to support new functionality. > >Furthermore, the type of data that is being stored and processed using >Apache Hadoop is evolving. Although Hadoop was originally used for >storing large volumes of structured text in the form of webpages and >log files, it is now common for Hadoop to store complex, structured >data formats such as JSON, Apache Avro, and Apache Thrift. These >formats allow developers to work with serialized objects in >programming languages like Java, C++, and Python, and allow for new >types of analysis to be performed on complex data types. Hadoop has >also been adopted by the scientific research community, who are using >Hadoop to process time series data, structured binary files in the >HDF5 format, and large medical and satellite images. > >Crunch addresses these challenges by providing a lightweight and >extensible Java API for defining the stages of a data processing >pipeline, which can then be run on an Apache Hadoop cluster as a >sequence of dependent !MapReduce jobs, or in-memory on a single >machine to facilitate fast testing and debugging. Crunch relies on a >small set of primitive abstractions that represent immutable, >distributed collections of objects. Developers define functions that >are applied to those objects in order to generate new immutable, >distributed collections of objects. Crunch also provides a library of >common !MapReduce patterns for performing efficient joins and >aggregation operations over these distributed collections that >developers may integrate into their own pipelines. Crunch also >provides native support for processing structured binary data formats >like JSON, Apache Avro, and Apache Thrift, and is designed to be >extensible to support working with any kind of data format that Java >supports in its native form. > >=3D=3D Initial Goals =3D=3D > >Crunch is currently in its first major release with a considerable >number of enhancement requests, tasks, and issues recorded towards its >future development. The initial goal of this project will be to >continue to build community in the spirit of the "Apache Way", and to >address the highly requested features and bug-fixes towards the next >dot release. > >Some goals include: > * To stand up a sustaining Apache-based community around the Crunch >codebase. > * Improved documentation of Java libraries and best practices. > * Support the ability to "fuse" logically independent pipeline stages >that aggregate the same data in different ways into a single >!MapReduce job. > * Performance, usability, and robustness improvements. > * Improving diagnostic reporting and debugging for individual !MapReduce >jobs. > * Providing a centralized place for contributed extensions and >domain-specific applications. > >=3D Current Status =3D > >=3D=3D Meritocracy =3D=3D > >Crunch was initially developed by Josh Wills in September 2011 at >Cloudera. Developers external to Cloudera provided feedback, suggested >features and fixes and implemented extensions of Crunch. Cloudera's >engineering team has since maintained the project with Josh Wills, Tom >White, and Brock Noland dedicated towards its improvement. >Contributors to Crunch include developers from multiple organizations, >including businesses and universities. > >=3D=3D Community =3D=3D > >Crunch is currently used by a number of organizations all over the >world. Crunch has an active and growing user and developer community >with active participation in >[[https://groups.google.com/a/cloudera.org/group/crunch- >users/topics|user]] >and [[https://groups.google.com/a/cloudera.org/group/crunch- >dev/topics|developer]] >mailing lists. > >Since open sourcing the project, there have been eight individuals >from five organizations who have contributed code. > >=3D=3D Core Developers =3D=3D > >The core developers for Crunch are: > * Brock Noland: Wrote many of the test cases, user documentation, and >contributed several bug fixes. > * Josh Wills: Josh wrote much of the original Crunch code. > * Gabriel Reid: Gabriel significantly improved Crunch's handling of >Avro data and has contributed several bug fixes for the core planner. > * Tom White: Tom added several libraries for common !MapReduce >pipeline operations, including the sort library and a library of set >operations. > * Christian Tzolov: Christian has contributed several bug fixes for >the Avro serialization module and the unit testing framework. > * Robert Chu: Robert did the left/right/outer join implementations >for Crunch and fixed several bugs in the runtime configuration logic. > >Several of the core developers of Crunch have contributed towards >Hadoop or related Apache projects and are familiar with Apache >principles and philosophy for community driven software development. > >=3D=3D Alignment =3D=3D > >Crunch complements several current Apache projects. It complements >Hadoop !MapReduce by providing a higher-level API for developing >complex data processing pipelines that require a sequence of >!MapReduce jobs to perform. Crunch also supports Apache HBase in order >to simplify the process of writing !MapReduce jobs that execute over >HBase tables. Crunch makes extensive use of the Apache Avro data >format as an internal data representation process that makes >!MapReduce jobs execute quickly and efficiently. > >=3D Known Risks =3D > >=3D=3D Orphaned Products =3D=3D > >Crunch is already deployed in production at multiple companies and >they are actively participating in creating new features. Crunch is >getting traction with developers and thus the risks of it being >orphaned are minimal. > >=3D=3D Inexperience with Open Source =3D=3D > >All code developed for Crunch has been open sourced by Cloudera under >Apache 2.0 license. All committers to Crunch are intimately familiar >with the Apache model for open-source development and are experienced >with working with new contributors. > >=3D=3D Homogeneous Developers =3D=3D > >The initial set of committers is from a reduced set of organizations. >However, we expect that once approved for incubation, the project will >attract new contributors from diverse organizations and will thus grow >organically. The submission of patches from developers from several >different organizations is a strong indication that Crunch will be >widely adopted. > >=3D=3D Reliance on Salaried Developers =3D=3D > >It is expected that Crunch will be developed on salaried and volunteer >time, although all of the initial developers will work on it mainly on >salaried time. > >=3D=3D Relationships with Other Apache Products =3D=3D > >Crunch depends upon other Apache Projects: Apache Hadoop, Apache >HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache >Commons components. Its build depends upon Apache Maven. > >Crunch's functionality has some indirect or direct overlap with the >functionality of Apache Pig and Apache Hive but has several >significant differences in terms of their user community and the types >of data they are designed to work with. Both Hive and Pig are >high-level languages that are designed to allow non-programmers to >quickly create and run !MapReduce jobs. Crunch is a Java library whose >primary community is Java developers who are creating scalable data >pipelines and !MapReduce-based applications. Additionally, Hive and >Pig both employ a relational, tuple-oriented data model on top of >HDFS, which introduces overhead and limits expressive power for >developers who are working with serialized objects and non-relational >data types. Crunch uses a lower-level data model that gives developers >the freedom to work with data in a format that is optimized for the >problem they are trying to solve. > >=3D=3D An Excessive Fascination with the Apache Brand =3D=3D > >We would like Crunch to become an Apache project to further foster a >healthy community of contributors and consumers around the project. >Since Crunch directly interacts with many Apache Hadoop-related >projects and solves an important problem of many Hadoop users, >residing in the Apache Software Foundation will increase interaction >with the larger community. > >=3D Documentation =3D > > * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki > * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch > * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/ > >=3D Initial Source =3D > > * https://github.com/cloudera/crunch/tree/ > >=3D=3D Source and Intellectual Property Submission Plan =3D=3D > > * The initial source is already licensed under the Apache License, >Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt > >=3D=3D External Dependencies =3D=3D > >The required external dependencies are all Apache License or >compatible licenses. Following components with non-Apache licenses are >enumerated: > > * com.google.protobuf : New BSD > * org.hamcrest: New BSD > * org.slf4j: MIT-like License > >Non-Apache build tools that are used by Crunch are as follows: > > * Cobertura: GNU GPLv2 > >Note that Cobertura is optional and is only used for calculating unit >test coverage. > >=3D=3D Cryptography =3D=3D > >Crunch uses standard APIs and tools for SSH and SSL communication >where necessary. > >=3D Required Resources =3D > >=3D=3D Mailing lists =3D=3D > > * crunch-private (with moderated subscriptions) > * crunch-dev > * crunch-commits > * crunch-user > >=3D=3D Github Repositories =3D=3D > >http://github.com/apache/crunch >git://git.apache.org/crunch.git > >=3D=3D Issue Tracking =3D=3D > >JIRA Crunch (CRUNCH) > >=3D=3D Other Resources =3D=3D > >The existing code already has unit and integration tests so we would >like a Jenkins instance to run them whenever a new patch is submitted. >This can be added after project creation. > >=3D Initial Committers =3D > > * Brock Noland (brock at cloudera dot com) > * Josh Wills (jwills at cloudera dot com) > * Gabriel Reid (gabriel dot reid at gmail dot com) > * Tom White (tom at cloudera dot com) > * Christian Tzolov (christian dot tzolov at gmail dot com) > * Robert Chu (robert at wibidata dot com) > * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com) > >=3D Affiliations =3D > > * Brock Noland, Cloudera > * Josh Wills, Cloudera > * Gabriel Reid, !TomTom > * Tom White, Cloudera > * Christian Tzolov, !TomTom > * Robert Chu, !WibiData > * Vinod Kumar Vavilapalli, Hortonworks > >=3D Sponsors =3D > >=3D=3D Champion =3D=3D > > * Patrick Hunt > >=3D=3D Nominated Mentors =3D=3D > > * Tom White > * Patrick Hunt > * Arun Murthy > >=3D=3D Sponsoring Entity =3D=3D > > * Apache Incubator PMC > >--------------------------------------------------------------------- >To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org >For additional commands, e-mail: general-help@incubator.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org