Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A6517C4A6 for ; Thu, 24 May 2012 16:49:59 +0000 (UTC) Received: (qmail 73197 invoked by uid 500); 24 May 2012 16:49:59 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 72991 invoked by uid 500); 24 May 2012 16:49:58 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 72982 invoked by uid 99); 24 May 2012 16:49:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 May 2012 16:49:58 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.210.47] (HELO mail-pz0-f47.google.com) (209.85.210.47) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 May 2012 16:49:51 +0000 Received: by dalh21 with SMTP id h21so10508458dal.6 for ; Thu, 24 May 2012 09:49:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:mime-version:content-type:subject:date:in-reply-to:to :references:message-id:x-mailer:x-gm-message-state; bh=IsV8rl8We2Xd3evXSparG4wS4gY7/1PhYj3G4zsd/rk=; b=IXEGcDNZQshVbjR2lcvbMd2TViCFOgKq89e/rAQ5WxzwTlnvBD9a2q+0SJSlqmrB5q 2RcbYvuMnB368OpYaJ3UqZ0hkiE1/xDUEacmtPy6vG0/vEkkfumxW1xYEzwtZM+xyNi1 ymeaqklw/zH4wgDMuBa9aIu7kc237BCqE0CiB9QLGZn+H7yiMu90lC+kT3OkmvWns0n5 HmPGQGhotpOvirDhQG5yq+a3OOqK6RnicwgYv7K5YB7AM8Q9OuqLrMnzZE1JzxcN6se9 0odEbQ4ECDygE/qfgQ9+X23mQ/9oGdhay5Yz6XY6gLKuFEn1aMoJwReQ+xfheD07YAxL LDkA== Received: by 10.68.191.230 with SMTP id hb6mr22638095pbc.57.1337878168757; Thu, 24 May 2012 09:49:28 -0700 (PDT) Received: from [10.10.11.134] (host1.hortonworks.com. [70.35.59.2]) by mx.google.com with ESMTPS id wi8sm5992789pbc.11.2012.05.24.09.49.26 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 24 May 2012 09:49:27 -0700 (PDT) From: Arun C Murthy Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: multipart/alternative; boundary=Apple-Mail-161-941196019 Subject: Re: [VOTE] Accept Crunch into the Apache Incubator Date: Thu, 24 May 2012 09:49:25 -0700 In-Reply-To: To: general@incubator.apache.org References: Message-Id: X-Mailer: Apple Mail (2.1084) X-Gm-Message-State: ALoCoQlnjUbn/F2yzC8m1FpcBvcdQgHXC53RPK9xA+DHwrTRC/B9qVT46QG8MiYqJ8FUUizTAcXD --Apple-Mail-161-941196019 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii +1 (binding) On May 23, 2012, at 11:45 AM, Josh Wills wrote: > I would like to call a vote for accepting "Apache Crunch" for > incubation in the Apache Incubator. The full proposal is available > below. We ask the Incubator PMC to sponsor it, with phunt as > Champion, and phunt, tomwhite, and acmurthy volunteering to be > Mentors. >=20 > Please cast your vote: >=20 > [ ] +1, bring Crunch into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Crunch into Incubator, because... >=20 > This vote will be open for 72 hours and only votes from the Incubator > PMC are binding. >=20 > http://wiki.apache.org/incubator/CrunchProposal >=20 > Proposal text from the wiki: > = --------------------------------------------------------------------------= -------------------------------------------- > =3D Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =3D >=20 > =3D=3D Abstract =3D=3D >=20 > Crunch is a Java library for writing, testing, and running pipelines > of !MapReduce jobs on Apache Hadoop. >=20 > =3D=3D Proposal =3D=3D >=20 > Crunch is a Java library for writing, testing, and running pipelines > of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a > high-level API for writing and testing complex !MapReduce jobs that > require multiple processing stages. It has a simple, flexible, and > extensible data model that makes it ideal for processing data that > does not naturally fit into a relational structure, such as time > series and serialized object formats like JSON and Avro. It supports > running pipelines either as a series of !MapReduce jobs on an Apache > Hadoop cluster or in memory on a single machine for fast testing and > debugging. >=20 > =3D=3D Background =3D=3D >=20 > Crunch was initially developed by Cloudera to simplify the process of > creating sequences of dependent !MapReduce jobs, especially jobs that > processed non-relational data like time series. Its design was based > on a paper Google published about a Java library they developed called > !FlumeJava that was created in order to solve a similar class of > problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache > 2.0 licensed project in October 2011. During this time Crunch has been > formally released twice, as versions 0.1.0 (October 2010) and 0.2.0 > (February 2012), with an incremental update to version 0.2.1 (March > 2012) . These releases are also distributed by Cloudera as source and > binaries from Cloudera's Maven repository. >=20 > =3D=3D Rationale =3D=3D >=20 > Most of the interesting analytical and data processing tasks that are > run on an Apache Hadoop cluster require a series of !MapReduce jobs to > be executed in sequence. Developers who are creating these pipelines > today need to manually assign the sequence of tasks to perform in a > dependent chain of !MapReduce jobs, even though there are a number of > well-known patterns for fusing dependent computations together into a > single !MapReduce stage and for performing common types of joins and > aggregations. This results in !MapReduce pipelines that are more > difficult to test, maintain, and extend to support new functionality. >=20 > Furthermore, the type of data that is being stored and processed using > Apache Hadoop is evolving. Although Hadoop was originally used for > storing large volumes of structured text in the form of webpages and > log files, it is now common for Hadoop to store complex, structured > data formats such as JSON, Apache Avro, and Apache Thrift. These > formats allow developers to work with serialized objects in > programming languages like Java, C++, and Python, and allow for new > types of analysis to be performed on complex data types. Hadoop has > also been adopted by the scientific research community, who are using > Hadoop to process time series data, structured binary files in the > HDF5 format, and large medical and satellite images. >=20 > Crunch addresses these challenges by providing a lightweight and > extensible Java API for defining the stages of a data processing > pipeline, which can then be run on an Apache Hadoop cluster as a > sequence of dependent !MapReduce jobs, or in-memory on a single > machine to facilitate fast testing and debugging. Crunch relies on a > small set of primitive abstractions that represent immutable, > distributed collections of objects. Developers define functions that > are applied to those objects in order to generate new immutable, > distributed collections of objects. Crunch also provides a library of > common !MapReduce patterns for performing efficient joins and > aggregation operations over these distributed collections that > developers may integrate into their own pipelines. Crunch also > provides native support for processing structured binary data formats > like JSON, Apache Avro, and Apache Thrift, and is designed to be > extensible to support working with any kind of data format that Java > supports in its native form. >=20 > =3D=3D Initial Goals =3D=3D >=20 > Crunch is currently in its first major release with a considerable > number of enhancement requests, tasks, and issues recorded towards its > future development. The initial goal of this project will be to > continue to build community in the spirit of the "Apache Way", and to > address the highly requested features and bug-fixes towards the next > dot release. >=20 > Some goals include: > * To stand up a sustaining Apache-based community around the Crunch = codebase. > * Improved documentation of Java libraries and best practices. > * Support the ability to "fuse" logically independent pipeline stages > that aggregate the same data in different ways into a single > !MapReduce job. > * Performance, usability, and robustness improvements. > * Improving diagnostic reporting and debugging for individual = !MapReduce jobs. > * Providing a centralized place for contributed extensions and > domain-specific applications. >=20 > =3D Current Status =3D >=20 > =3D=3D Meritocracy =3D=3D >=20 > Crunch was initially developed by Josh Wills in September 2011 at > Cloudera. Developers external to Cloudera provided feedback, suggested > features and fixes and implemented extensions of Crunch. Cloudera's > engineering team has since maintained the project with Josh Wills, Tom > White, and Brock Noland dedicated towards its improvement. > Contributors to Crunch include developers from multiple organizations, > including businesses and universities. >=20 > =3D=3D Community =3D=3D >=20 > Crunch is currently used by a number of organizations all over the > world. Crunch has an active and growing user and developer community > with active participation in > = [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]= ] > and = [[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|develop= er]] > mailing lists. >=20 > Since open sourcing the project, there have been eight individuals > from five organizations who have contributed code. >=20 > =3D=3D Core Developers =3D=3D >=20 > The core developers for Crunch are: > * Brock Noland: Wrote many of the test cases, user documentation, and > contributed several bug fixes. > * Josh Wills: Josh wrote much of the original Crunch code. > * Gabriel Reid: Gabriel significantly improved Crunch's handling of > Avro data and has contributed several bug fixes for the core planner. > * Tom White: Tom added several libraries for common !MapReduce > pipeline operations, including the sort library and a library of set > operations. > * Christian Tzolov: Christian has contributed several bug fixes for > the Avro serialization module and the unit testing framework. > * Robert Chu: Robert did the left/right/outer join implementations > for Crunch and fixed several bugs in the runtime configuration logic. >=20 > Several of the core developers of Crunch have contributed towards > Hadoop or related Apache projects and are familiar with Apache > principles and philosophy for community driven software development. >=20 > =3D=3D Alignment =3D=3D >=20 > Crunch complements several current Apache projects. It complements > Hadoop !MapReduce by providing a higher-level API for developing > complex data processing pipelines that require a sequence of > !MapReduce jobs to perform. Crunch also supports Apache HBase in order > to simplify the process of writing !MapReduce jobs that execute over > HBase tables. Crunch makes extensive use of the Apache Avro data > format as an internal data representation process that makes > !MapReduce jobs execute quickly and efficiently. >=20 > =3D Known Risks =3D >=20 > =3D=3D Orphaned Products =3D=3D >=20 > Crunch is already deployed in production at multiple companies and > they are actively participating in creating new features. Crunch is > getting traction with developers and thus the risks of it being > orphaned are minimal. >=20 > =3D=3D Inexperience with Open Source =3D=3D >=20 > All code developed for Crunch has been open sourced by Cloudera under > Apache 2.0 license. All committers to Crunch are intimately familiar > with the Apache model for open-source development and are experienced > with working with new contributors. >=20 > =3D=3D Homogeneous Developers =3D=3D >=20 > The initial set of committers is from a reduced set of organizations. > However, we expect that once approved for incubation, the project will > attract new contributors from diverse organizations and will thus grow > organically. The submission of patches from developers from several > different organizations is a strong indication that Crunch will be > widely adopted. >=20 > =3D=3D Reliance on Salaried Developers =3D=3D >=20 > It is expected that Crunch will be developed on salaried and volunteer > time, although all of the initial developers will work on it mainly on > salaried time. >=20 > =3D=3D Relationships with Other Apache Products =3D=3D >=20 > Crunch depends upon other Apache Projects: Apache Hadoop, Apache > HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache > Commons components. Its build depends upon Apache Maven. >=20 > Crunch's functionality has some indirect or direct overlap with the > functionality of Apache Pig and Apache Hive but has several > significant differences in terms of their user community and the types > of data they are designed to work with. Both Hive and Pig are > high-level languages that are designed to allow non-programmers to > quickly create and run !MapReduce jobs. Crunch is a Java library whose > primary community is Java developers who are creating scalable data > pipelines and !MapReduce-based applications. Additionally, Hive and > Pig both employ a relational, tuple-oriented data model on top of > HDFS, which introduces overhead and limits expressive power for > developers who are working with serialized objects and non-relational > data types. Crunch uses a lower-level data model that gives developers > the freedom to work with data in a format that is optimized for the > problem they are trying to solve. >=20 > =3D=3D An Excessive Fascination with the Apache Brand =3D=3D >=20 > We would like Crunch to become an Apache project to further foster a > healthy community of contributors and consumers around the project. > Since Crunch directly interacts with many Apache Hadoop-related > projects and solves an important problem of many Hadoop users, > residing in the Apache Software Foundation will increase interaction > with the larger community. >=20 > =3D Documentation =3D >=20 > * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki > * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch > * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/ >=20 > =3D Initial Source =3D >=20 > * https://github.com/cloudera/crunch/tree/ >=20 > =3D=3D Source and Intellectual Property Submission Plan =3D=3D >=20 > * The initial source is already licensed under the Apache License, > Version 2.0. = https://github.com/cloudera/crunch/blob/master/LICENSE.txt >=20 > =3D=3D External Dependencies =3D=3D >=20 > The required external dependencies are all Apache License or > compatible licenses. Following components with non-Apache licenses are > enumerated: >=20 > * com.google.protobuf : New BSD > * org.hamcrest: New BSD > * org.slf4j: MIT-like License >=20 > Non-Apache build tools that are used by Crunch are as follows: >=20 > * Cobertura: GNU GPLv2 >=20 > Note that Cobertura is optional and is only used for calculating unit > test coverage. >=20 > =3D=3D Cryptography =3D=3D >=20 > Crunch uses standard APIs and tools for SSH and SSL communication > where necessary. >=20 > =3D Required Resources =3D >=20 > =3D=3D Mailing lists =3D=3D >=20 > * crunch-private (with moderated subscriptions) > * crunch-dev > * crunch-commits > * crunch-user >=20 > =3D=3D Github Repositories =3D=3D >=20 > http://github.com/apache/crunch > git://git.apache.org/crunch.git >=20 > =3D=3D Issue Tracking =3D=3D >=20 > JIRA Crunch (CRUNCH) >=20 > =3D=3D Other Resources =3D=3D >=20 > The existing code already has unit and integration tests so we would > like a Jenkins instance to run them whenever a new patch is submitted. > This can be added after project creation. >=20 > =3D Initial Committers =3D >=20 > * Brock Noland (brock at cloudera dot com) > * Josh Wills (jwills at cloudera dot com) > * Gabriel Reid (gabriel dot reid at gmail dot com) > * Tom White (tom at cloudera dot com) > * Christian Tzolov (christian dot tzolov at gmail dot com) > * Robert Chu (robert at wibidata dot com) > * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com) >=20 > =3D Affiliations =3D >=20 > * Brock Noland, Cloudera > * Josh Wills, Cloudera > * Gabriel Reid, !TomTom > * Tom White, Cloudera > * Christian Tzolov, !TomTom > * Robert Chu, !WibiData > * Vinod Kumar Vavilapalli, Hortonworks >=20 > =3D Sponsors =3D >=20 > =3D=3D Champion =3D=3D >=20 > * Patrick Hunt >=20 > =3D=3D Nominated Mentors =3D=3D >=20 > * Tom White > * Patrick Hunt > * Arun Murthy >=20 > =3D=3D Sponsoring Entity =3D=3D >=20 > * Apache Incubator PMC >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > For additional commands, e-mail: general-help@incubator.apache.org >=20 -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ --Apple-Mail-161-941196019--