Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A2A8F9C87 for ; Wed, 23 May 2012 21:17:32 +0000 (UTC) Received: (qmail 90964 invoked by uid 500); 23 May 2012 21:17:31 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 90772 invoked by uid 500); 23 May 2012 21:17:31 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 90763 invoked by uid 99); 23 May 2012 21:17:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 May 2012 21:17:31 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of brock@cloudera.com designates 209.85.161.175 as permitted sender) Received: from [209.85.161.175] (HELO mail-gg0-f175.google.com) (209.85.161.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 May 2012 21:17:27 +0000 Received: by ggnp4 with SMTP id p4so7148945ggn.6 for ; Wed, 23 May 2012 14:17:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=JBkXpYc5ZL7eQ3zZImbBZsb8Y1sMw79reRGvAOMASeU=; b=BMp4I4tHzwgWv/eFSANmfuGs0CNrj+1ErL2Wms1lO882amq5fh8spg7pYi2D0vSlTd 2aCEpZ2rOnCy0x56ouMd2v/AY0cidhVNapsvm06O7xjyCQPZY96KdS9qbm9x1om4klLq 3J9KZkq7qlUSkrOtUOmaYjO2vYQvOUz0Ez5cWfGNbzN1KpJ4gpzSzGtZUCIQl6cvLPh9 oe0Z6lCgwMhzOQnNo4O90APHedbK5/V1MB+tiauN4aQgEFb5p1EK7ZI1l7eLuEdt4SjD D1Yq+kOG+jHFm9jmtQXwMOdP+X0Nl4npD9oWQOabz5iXeVsX2t1AwoJcgAf7hU+m3bH4 Qjsg== Received: by 10.60.3.34 with SMTP id 2mr9331068oez.27.1337807826541; Wed, 23 May 2012 14:17:06 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.88.74 with HTTP; Wed, 23 May 2012 14:16:46 -0700 (PDT) In-Reply-To: References: From: Brock Noland Date: Wed, 23 May 2012 14:16:46 -0700 Message-ID: Subject: Re: [VOTE] Accept Crunch into the Apache Incubator To: "" Content-Type: multipart/alternative; boundary=e89a8fb1ff04475cfd04c0baa8aa X-Gm-Message-State: ALoCoQlIyWqXDmJsq1Yph6DfJXvp8d79U6qlHQAlVWfH8Kp4SSrnNezNLjAcrHttkOv69DuJ70oi X-Virus-Checked: Checked by ClamAV on apache.org --e89a8fb1ff04475cfd04c0baa8aa Content-Type: text/plain; charset=ISO-8859-1 [X] +1, bring Crunch into Incubator (non-binding) On Wed, May 23, 2012 at 11:45 AM, Josh Wills wrote: > I would like to call a vote for accepting "Apache Crunch" for > incubation in the Apache Incubator. The full proposal is available > below. We ask the Incubator PMC to sponsor it, with phunt as > Champion, and phunt, tomwhite, and acmurthy volunteering to be > Mentors. > > Please cast your vote: > > [ ] +1, bring Crunch into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Crunch into Incubator, because... > > This vote will be open for 72 hours and only votes from the Incubator > PMC are binding. > > http://wiki.apache.org/incubator/CrunchProposal > > Proposal text from the wiki: > > ---------------------------------------------------------------------------------------------------------------------- > = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala = > > == Abstract == > > Crunch is a Java library for writing, testing, and running pipelines > of !MapReduce jobs on Apache Hadoop. > > == Proposal == > > Crunch is a Java library for writing, testing, and running pipelines > of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a > high-level API for writing and testing complex !MapReduce jobs that > require multiple processing stages. It has a simple, flexible, and > extensible data model that makes it ideal for processing data that > does not naturally fit into a relational structure, such as time > series and serialized object formats like JSON and Avro. It supports > running pipelines either as a series of !MapReduce jobs on an Apache > Hadoop cluster or in memory on a single machine for fast testing and > debugging. > > == Background == > > Crunch was initially developed by Cloudera to simplify the process of > creating sequences of dependent !MapReduce jobs, especially jobs that > processed non-relational data like time series. Its design was based > on a paper Google published about a Java library they developed called > !FlumeJava that was created in order to solve a similar class of > problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache > 2.0 licensed project in October 2011. During this time Crunch has been > formally released twice, as versions 0.1.0 (October 2010) and 0.2.0 > (February 2012), with an incremental update to version 0.2.1 (March > 2012) . These releases are also distributed by Cloudera as source and > binaries from Cloudera's Maven repository. > > == Rationale == > > Most of the interesting analytical and data processing tasks that are > run on an Apache Hadoop cluster require a series of !MapReduce jobs to > be executed in sequence. Developers who are creating these pipelines > today need to manually assign the sequence of tasks to perform in a > dependent chain of !MapReduce jobs, even though there are a number of > well-known patterns for fusing dependent computations together into a > single !MapReduce stage and for performing common types of joins and > aggregations. This results in !MapReduce pipelines that are more > difficult to test, maintain, and extend to support new functionality. > > Furthermore, the type of data that is being stored and processed using > Apache Hadoop is evolving. Although Hadoop was originally used for > storing large volumes of structured text in the form of webpages and > log files, it is now common for Hadoop to store complex, structured > data formats such as JSON, Apache Avro, and Apache Thrift. These > formats allow developers to work with serialized objects in > programming languages like Java, C++, and Python, and allow for new > types of analysis to be performed on complex data types. Hadoop has > also been adopted by the scientific research community, who are using > Hadoop to process time series data, structured binary files in the > HDF5 format, and large medical and satellite images. > > Crunch addresses these challenges by providing a lightweight and > extensible Java API for defining the stages of a data processing > pipeline, which can then be run on an Apache Hadoop cluster as a > sequence of dependent !MapReduce jobs, or in-memory on a single > machine to facilitate fast testing and debugging. Crunch relies on a > small set of primitive abstractions that represent immutable, > distributed collections of objects. Developers define functions that > are applied to those objects in order to generate new immutable, > distributed collections of objects. Crunch also provides a library of > common !MapReduce patterns for performing efficient joins and > aggregation operations over these distributed collections that > developers may integrate into their own pipelines. Crunch also > provides native support for processing structured binary data formats > like JSON, Apache Avro, and Apache Thrift, and is designed to be > extensible to support working with any kind of data format that Java > supports in its native form. > > == Initial Goals == > > Crunch is currently in its first major release with a considerable > number of enhancement requests, tasks, and issues recorded towards its > future development. The initial goal of this project will be to > continue to build community in the spirit of the "Apache Way", and to > address the highly requested features and bug-fixes towards the next > dot release. > > Some goals include: > * To stand up a sustaining Apache-based community around the Crunch > codebase. > * Improved documentation of Java libraries and best practices. > * Support the ability to "fuse" logically independent pipeline stages > that aggregate the same data in different ways into a single > !MapReduce job. > * Performance, usability, and robustness improvements. > * Improving diagnostic reporting and debugging for individual !MapReduce > jobs. > * Providing a centralized place for contributed extensions and > domain-specific applications. > > = Current Status = > > == Meritocracy == > > Crunch was initially developed by Josh Wills in September 2011 at > Cloudera. Developers external to Cloudera provided feedback, suggested > features and fixes and implemented extensions of Crunch. Cloudera's > engineering team has since maintained the project with Josh Wills, Tom > White, and Brock Noland dedicated towards its improvement. > Contributors to Crunch include developers from multiple organizations, > including businesses and universities. > > == Community == > > Crunch is currently used by a number of organizations all over the > world. Crunch has an active and growing user and developer community > with active participation in > [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user > ]] > and [[ > https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer > ]] > mailing lists. > > Since open sourcing the project, there have been eight individuals > from five organizations who have contributed code. > > == Core Developers == > > The core developers for Crunch are: > * Brock Noland: Wrote many of the test cases, user documentation, and > contributed several bug fixes. > * Josh Wills: Josh wrote much of the original Crunch code. > * Gabriel Reid: Gabriel significantly improved Crunch's handling of > Avro data and has contributed several bug fixes for the core planner. > * Tom White: Tom added several libraries for common !MapReduce > pipeline operations, including the sort library and a library of set > operations. > * Christian Tzolov: Christian has contributed several bug fixes for > the Avro serialization module and the unit testing framework. > * Robert Chu: Robert did the left/right/outer join implementations > for Crunch and fixed several bugs in the runtime configuration logic. > > Several of the core developers of Crunch have contributed towards > Hadoop or related Apache projects and are familiar with Apache > principles and philosophy for community driven software development. > > == Alignment == > > Crunch complements several current Apache projects. It complements > Hadoop !MapReduce by providing a higher-level API for developing > complex data processing pipelines that require a sequence of > !MapReduce jobs to perform. Crunch also supports Apache HBase in order > to simplify the process of writing !MapReduce jobs that execute over > HBase tables. Crunch makes extensive use of the Apache Avro data > format as an internal data representation process that makes > !MapReduce jobs execute quickly and efficiently. > > = Known Risks = > > == Orphaned Products == > > Crunch is already deployed in production at multiple companies and > they are actively participating in creating new features. Crunch is > getting traction with developers and thus the risks of it being > orphaned are minimal. > > == Inexperience with Open Source == > > All code developed for Crunch has been open sourced by Cloudera under > Apache 2.0 license. All committers to Crunch are intimately familiar > with the Apache model for open-source development and are experienced > with working with new contributors. > > == Homogeneous Developers == > > The initial set of committers is from a reduced set of organizations. > However, we expect that once approved for incubation, the project will > attract new contributors from diverse organizations and will thus grow > organically. The submission of patches from developers from several > different organizations is a strong indication that Crunch will be > widely adopted. > > == Reliance on Salaried Developers == > > It is expected that Crunch will be developed on salaried and volunteer > time, although all of the initial developers will work on it mainly on > salaried time. > > == Relationships with Other Apache Products == > > Crunch depends upon other Apache Projects: Apache Hadoop, Apache > HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache > Commons components. Its build depends upon Apache Maven. > > Crunch's functionality has some indirect or direct overlap with the > functionality of Apache Pig and Apache Hive but has several > significant differences in terms of their user community and the types > of data they are designed to work with. Both Hive and Pig are > high-level languages that are designed to allow non-programmers to > quickly create and run !MapReduce jobs. Crunch is a Java library whose > primary community is Java developers who are creating scalable data > pipelines and !MapReduce-based applications. Additionally, Hive and > Pig both employ a relational, tuple-oriented data model on top of > HDFS, which introduces overhead and limits expressive power for > developers who are working with serialized objects and non-relational > data types. Crunch uses a lower-level data model that gives developers > the freedom to work with data in a format that is optimized for the > problem they are trying to solve. > > == An Excessive Fascination with the Apache Brand == > > We would like Crunch to become an Apache project to further foster a > healthy community of contributors and consumers around the project. > Since Crunch directly interacts with many Apache Hadoop-related > projects and solves an important problem of many Hadoop users, > residing in the Apache Software Foundation will increase interaction > with the larger community. > > = Documentation = > > * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki > * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch > * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/ > > = Initial Source = > > * https://github.com/cloudera/crunch/tree/ > > == Source and Intellectual Property Submission Plan == > > * The initial source is already licensed under the Apache License, > Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt > > == External Dependencies == > > The required external dependencies are all Apache License or > compatible licenses. Following components with non-Apache licenses are > enumerated: > > * com.google.protobuf : New BSD > * org.hamcrest: New BSD > * org.slf4j: MIT-like License > > Non-Apache build tools that are used by Crunch are as follows: > > * Cobertura: GNU GPLv2 > > Note that Cobertura is optional and is only used for calculating unit > test coverage. > > == Cryptography == > > Crunch uses standard APIs and tools for SSH and SSL communication > where necessary. > > = Required Resources = > > == Mailing lists == > > * crunch-private (with moderated subscriptions) > * crunch-dev > * crunch-commits > * crunch-user > > == Github Repositories == > > http://github.com/apache/crunch > git://git.apache.org/crunch.git > > == Issue Tracking == > > JIRA Crunch (CRUNCH) > > == Other Resources == > > The existing code already has unit and integration tests so we would > like a Jenkins instance to run them whenever a new patch is submitted. > This can be added after project creation. > > = Initial Committers = > > * Brock Noland (brock at cloudera dot com) > * Josh Wills (jwills at cloudera dot com) > * Gabriel Reid (gabriel dot reid at gmail dot com) > * Tom White (tom at cloudera dot com) > * Christian Tzolov (christian dot tzolov at gmail dot com) > * Robert Chu (robert at wibidata dot com) > * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com) > > = Affiliations = > > * Brock Noland, Cloudera > * Josh Wills, Cloudera > * Gabriel Reid, !TomTom > * Tom White, Cloudera > * Christian Tzolov, !TomTom > * Robert Chu, !WibiData > * Vinod Kumar Vavilapalli, Hortonworks > > = Sponsors = > > == Champion == > > * Patrick Hunt > > == Nominated Mentors == > > * Tom White > * Patrick Hunt > * Arun Murthy > > == Sponsoring Entity == > > * Apache Incubator PMC > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ --e89a8fb1ff04475cfd04c0baa8aa--