Mailing-List: contact general-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
From: Arun C Murthy <acm@hortonworks.com>
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: multipart/alternative; boundary=Apple-Mail-161-941196019
Subject: Re: [VOTE] Accept Crunch into the Apache Incubator
Date: Thu, 24 May 2012 09:49:25 -0700
In-Reply-To: 
 <CAH29n6PzL_J3_2a-Ayj7W3yVweniBrFhULXk4NPF3hfPXV6hSg@mail.gmail.com>
To: general@incubator.apache.org
References: 
 <CAH29n6PzL_J3_2a-Ayj7W3yVweniBrFhULXk4NPF3hfPXV6hSg@mail.gmail.com>
Message-Id: <E5FB2186-C100-4374-885A-11A5643A1BC1@hortonworks.com>

--Apple-Mail-161-941196019
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

+1 (binding)

On May 23, 2012, at 11:45 AM, Josh Wills wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
>=20
> Please cast your vote:
>=20
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
>=20
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
>=20
> http://wiki.apache.org/incubator/CrunchProposal
>=20
> Proposal text from the wiki:
> =
--------------------------------------------------------------------------=
--------------------------------------------
> =3D Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =3D
>=20
> =3D=3D Abstract =3D=3D
>=20
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
>=20
> =3D=3D Proposal =3D=3D
>=20
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
>=20
> =3D=3D Background =3D=3D
>=20
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
>=20
> =3D=3D Rationale =3D=3D
>=20
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
>=20
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
>=20
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
>=20
> =3D=3D Initial Goals =3D=3D
>=20
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
>=20
> Some goals include:
> * To stand up a sustaining Apache-based community around the Crunch =
codebase.
> * Improved documentation of Java libraries and best practices.
> * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
> * Performance, usability, and robustness improvements.
> * Improving diagnostic reporting and debugging for individual =
!MapReduce jobs.
> * Providing a centralized place for contributed extensions and
> domain-specific applications.
>=20
> =3D Current Status =3D
>=20
> =3D=3D Meritocracy =3D=3D
>=20
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
>=20
> =3D=3D Community =3D=3D
>=20
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> =
[[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]=
]
> and =
[[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|develop=
er]]
> mailing lists.
>=20
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
>=20
> =3D=3D Core Developers =3D=3D
>=20
> The core developers for Crunch are:
> * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
> * Josh Wills: Josh wrote much of the original Crunch code.
> * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
> * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
> * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
> * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
>=20
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
>=20
> =3D=3D Alignment =3D=3D
>=20
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
>=20
> =3D Known Risks =3D
>=20
> =3D=3D Orphaned Products =3D=3D
>=20
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
>=20
> =3D=3D Inexperience with Open Source =3D=3D
>=20
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
>=20
> =3D=3D Homogeneous Developers =3D=3D
>=20
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
>=20
> =3D=3D Reliance on Salaried Developers =3D=3D
>=20
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
>=20
> =3D=3D Relationships with Other Apache Products =3D=3D
>=20
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
>=20
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
>=20
> =3D=3D An Excessive Fascination with the Apache Brand =3D=3D
>=20
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
>=20
> =3D Documentation =3D
>=20
> * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
> * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
> * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
>=20
> =3D Initial Source =3D
>=20
> * https://github.com/cloudera/crunch/tree/
>=20
> =3D=3D Source and Intellectual Property Submission Plan =3D=3D
>=20
> * The initial source is already licensed under the Apache License,
> Version 2.0. =
https://github.com/cloudera/crunch/blob/master/LICENSE.txt
>=20
> =3D=3D External Dependencies =3D=3D
>=20
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
>=20
> * com.google.protobuf : New BSD
> * org.hamcrest: New BSD
> * org.slf4j: MIT-like License
>=20
> Non-Apache build tools that are used by Crunch are as follows:
>=20
> * Cobertura: GNU GPLv2
>=20
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
>=20
> =3D=3D Cryptography =3D=3D
>=20
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
>=20
> =3D Required  Resources =3D
>=20
> =3D=3D Mailing lists =3D=3D
>=20
> * crunch-private (with moderated subscriptions)
> * crunch-dev
> * crunch-commits
> * crunch-user
>=20
> =3D=3D Github Repositories =3D=3D
>=20
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
>=20
> =3D=3D Issue Tracking =3D=3D
>=20
> JIRA Crunch (CRUNCH)
>=20
> =3D=3D Other Resources =3D=3D
>=20
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
>=20
> =3D Initial Committers =3D
>=20
> * Brock Noland (brock at cloudera dot com)
> * Josh Wills (jwills at cloudera dot com)
> * Gabriel Reid (gabriel dot reid at gmail dot com)
> * Tom White (tom at cloudera dot com)
> * Christian Tzolov (christian dot tzolov at gmail dot com)
> * Robert Chu (robert at wibidata dot com)
> * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
>=20
> =3D Affiliations =3D
>=20
> * Brock Noland, Cloudera
> * Josh Wills, Cloudera
> * Gabriel Reid, !TomTom
> * Tom White, Cloudera
> * Christian Tzolov, !TomTom
> * Robert Chu, !WibiData
> * Vinod Kumar Vavilapalli, Hortonworks
>=20
> =3D Sponsors =3D
>=20
> =3D=3D Champion =3D=3D
>=20
> * Patrick Hunt
>=20
> =3D=3D Nominated Mentors =3D=3D
>=20
> * Tom White
> * Patrick Hunt
> * Arun Murthy
>=20
> =3D=3D Sponsoring Entity =3D=3D
>=20
> * Apache Incubator PMC
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>=20

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/


--Apple-Mail-161-941196019--