Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 335C811A1F for ; Mon, 19 May 2014 01:28:35 +0000 (UTC) Received: (qmail 38329 invoked by uid 500); 19 May 2014 01:28:34 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 38165 invoked by uid 500); 19 May 2014 01:28:34 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 38155 invoked by uid 99); 19 May 2014 01:28:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 May 2014 01:28:34 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tnachen@gmail.com designates 209.85.220.54 as permitted sender) Received: from [209.85.220.54] (HELO mail-pa0-f54.google.com) (209.85.220.54) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 May 2014 01:28:30 +0000 Received: by mail-pa0-f54.google.com with SMTP id bj1so4996445pad.41 for ; Sun, 18 May 2014 18:28:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:content-transfer-encoding:mime-version:subject :message-id:date:references:in-reply-to:to; bh=jRO1m42jpXAkkoCeS29ldvLNeSNDBuudXx2gHhvD5XI=; b=tgXfwUOtKcokzwc0OH/enDFxSnGiLy0RV2ORmLrxloKsyuokKEuDAwNkJ8uwZzOjKk 2j8Pq6ogBa1UJNROghpxteOeVjNFncFA6Co8ok3b2HoWML81ts0cuAz3D5tIpjbmU1mZ HOQQwlykzODRfL0s41cTHTPb9lUFaY8K82ZeXeCiCOyHAfdvi4HzjtW9In/ImVIrrkvb aPt0JwUIDdFJXcdhKOzSeMqU4uGTrBWo2wbM0NPTkOPi0wBZUVN7UYDWEdY72GH4Txme jFUWomWQj/7WY5NvHvgbDPHqQmTlGUeTW/pyPZCsxO/v+O6rwlc6feV8ar8mrnZFWT5r kCqA== X-Received: by 10.68.201.10 with SMTP id jw10mr38559758pbc.25.1400462890021; Sun, 18 May 2014 18:28:10 -0700 (PDT) Received: from [10.54.99.60] (mobile-166-137-178-047.mycingular.net. [166.137.178.47]) by mx.google.com with ESMTPSA id jq6sm26863772pbb.76.2014.05.18.18.28.08 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 18 May 2014 18:28:09 -0700 (PDT) From: Timothy Chen Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Subject: Re: [VOTE] Accept Parquet into the incubator Message-Id: <3A14C077-207C-4D17-9872-F3CFF5DDC654@gmail.com> Date: Sun, 18 May 2014 18:28:09 -0700 References: In-Reply-To: To: "general@incubator.apache.org" X-Mailer: iPhone Mail (11D201) X-Virus-Checked: Checked by ClamAV on apache.org +1 non-binding. Tim > On May 18, 2014, at 6:14 PM, Jake Farrell wrote: >=20 > +1 (binding) >=20 > -Jake >=20 >=20 >=20 > On Sun, May 18, 2014 at 5:15 PM, Chris Aniszczyk wro= te: >=20 >> Based on the results of the discussion thread: >>=20 >> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3= CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E= >>=20 >> I would like to call a vote on accepting Parquet into the incubator. >> https://wiki.apache.org/incubator/ParquetProposal >>=20 >> [ ] +1 Accept Parquet into the Incubator >> [ ] +0 Indifferent to the acceptance of Parquet >> [ ] -1 Do not accept Parquet because ... >>=20 >> The vote will be open until Thursday May 22nd 18:00 UTC. >>=20 >> =3D Parquet Proposal =3D >>=20 >> =3D=3D Abstract =3D=3D >> Parquet is a columnar storage format for Hadoop. >>=20 >> =3D=3D Proposal =3D=3D >>=20 >> We created Parquet to make the advantages of compressed, efficient column= ar >> data representation available to any project in the Hadoop ecosystem, >> regardless of the choice of data processing framework, data model, or >> programming language. >>=20 >> =3D=3D Background =3D=3D >>=20 >> Parquet is built from the ground up with complex nested data structures i= n >> mind, and uses the repetition/definition level approach to encoding such >> data structures, as popularized by Google Dremel ( >> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believ= e >> this approach is superior to simple flattening of nested name spaces. >>=20 >> Parquet is built to support very efficient compression and encoding >> schemes. Parquet allows compression schemes to be specified on a per-colu= mn >> level, and is future-proofed to allow adding more encodings as they are >> invented and implemented. We separate the concepts of encoding and >> compression, allowing parquet consumers to implement operators that work >> directly on encoded data without paying decompression and decoding penalt= y >> when possible. >>=20 >> =3D=3D Rationale =3D=3D >>=20 >> Parquet is built to be used by anyone. We believe that an efficient, >> well-implemented columnar storage substrate should be useful to all >> frameworks without the cost of extensive and difficult to set up >> dependencies. >>=20 >> Furthermore, the rapid growth of Parquet community is empowered by open >> source. We believe the Apache foundation is a great fit as the long-term >> home for Parquet, as it provides an established process for >> community-driven development and decision making by consensus. This is >> exactly the model we want for future Parquet development. >>=20 >> =3D=3D Initial Goals =3D=3D >>=20 >> * Move the existing codebase to Apache >> * Integrate with the Apache development process >> * Ensure all dependencies are compliant with Apache License version 2.0 >> * Incremental development and releases per Apache guidelines >>=20 >> =3D=3D Current Status =3D=3D >>=20 >> Parquet has undergone 2 major releases: >> https://github.com/Parquet/parquet-format/releases of the core format and= >> 22 releases: https://github.com/Parquet/parquet-mr/releases of the >> supporting set of Java libraries. >>=20 >> The Parquet source is currently hosted at GitHub, which will seed the >> Apache git repository. >>=20 >> =3D=3D=3D Meritocracy =3D=3D=3D >>=20 >> We plan to invest in supporting a meritocracy. We will discuss the >> requirements in an open forum. Several companies have already expressed >> interest in this project, and we intend to invite additional developers t= o >> participate. We will encourage and monitor community participation so tha= t >> privileges can be extended to those that contribute. >>=20 >> =3D=3D=3D Community =3D=3D=3D >>=20 >> There is a large need for an advanced columnar storage format for Hadoop.= >> Parquet is being used in production by many organizations (see >> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md) >>=20 >> * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392 >> * Criteo: https://twitter.com/julsimon/statuses/312114074911666177 >> * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784 >> * Stripe: https://twitter.com/avibryant/statuses/391339949250715648 >> * Twitter: https://twitter.com/J_/statuses/315844725611581441 >>=20 >> By bringing Parquet into Apache, we believe that the community will grow >> even bigger. >>=20 >> =3D=3D=3D Core Developers =3D=3D=3D >>=20 >> Parquet was initially developed as a collaboration between Twitter, >> Cloudera and Criteo. >>=20 >> See >>=20 >> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-= hadoop >>=20 >> =3D=3D=3D Alignment =3D=3D=3D >>=20 >> We believe that having Parquet at Apache will help further the growth of >> the big-data community, as it will encourage cooperation within the great= er >> ecosystem of projects spawned by Apache Hadoop. The alignment is also >> beneficial to other Apache communities (such as Hadoop, Hive, Avro). >>=20 >> =3D=3D Known Risks =3D=3D >>=20 >> =3D=3D=3D Orphaned Products =3D=3D=3D >>=20 >> The risk of the Parquet project being abandoned is minimal. There are man= y >> organizations using Parquet in production, including Twitter, Cloudera, >> Stripe, and Salesforce ( >> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/). >>=20 >> =3D=3D=3D Inexperience with Open Source =3D=3D=3D >>=20 >> Parquet has existed as a healthy open source for one year. During that >> time, we have curated an open-source community successfully, attracting >> over 40 contributors (see >> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse= >> group of companies. >> Several of the core contributors to the project are deeply familiar with >> OSS and Apache specifically: Julien Le Dem was until recently the PMC Cha= ir >> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney >> are also Apache Pig committers with contributions to several other Apache= >> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and >> multiple other related projects. Brock Noland is a Hive committer. >>=20 >> =3D=3D=3D Homogenous Developers =3D=3D=3D >>=20 >> The initial committers come from a number of companies and countries. >> Parquet has an active community of developers, and we are committed to >> recruiting additional committers based on their contributions to the >> project. The java library component alone has contributions from 31 >> individual github accounts, 14 of which contributed over 1000 lines of >> code. >>=20 >> =3D=3D=3D Reliance on Salaried Developers =3D=3D=3D >>=20 >> It is expected that Parquet development will occur on both salaried time >> and on volunteer time, after hours. The majority of initial committers ar= e >> paid by their employers to contribute to this project. However, they are >> all passionate about the project, and we are confident that the project >> will continue even if no salaried developers contribute to the project. A= s >> evidence of this statement, we present the GitHub punchcard (see >> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a >> lot >> of activity happens on weekends. We are committed to recruiting additiona= l >> committers including non-salaried developers. >>=20 >> =3D=3D=3D Relationships with Other Apache Products =3D=3D=3D >>=20 >> As mentioned in the Alignment section, Parquet is closely related to >> Hadoop. It provides an API that allowed it to be easily integrated with >> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch= , >> Tajo. Some of the features it provides are similar to the ORC file format= >> which is part of the Hive project. However Parquet focused on being >> framework agnostic and language independent and has been really successfu= l >> to that end. On top of the Apache projects mentioned above, Parquet is al= so >> integrated with other open source projects, including Protocol Buffers, >> Cloudera Impala or Scrooge. We look forward to continue collaborating wit= h >> those communities, as well as other Apache communities. >>=20 >> =3D=3D=3D An Excessive Fascination with the Apache Brand =3D=3D=3D >>=20 >> Parquet is an already healthy and well known open source project. This >> proposal is not for the purpose of generating publicity. Rather, the >> primary benefits to joining Apache are those outlined in the Rationale >> section. >>=20 >> =3D=3D Documentation =3D=3D >>=20 >> Documentation is currently located as README markdown files: >>=20 >> * https://github.com/Parquet/parquet-format >> * https://github.com/Parquet/parquet-mr >>=20 >> =3D=3D Source and Intellectual Property Submission Plan =3D=3D >>=20 >> The Parquet codebase is currently hosted on Github: >> https://github.com/Parquet. >>=20 >> These are the codebases that we would migrate to the Apache foundation. >>=20 >> =3D=3D External Dependencies =3D=3D >>=20 >>=20 >> * Junit: EPL >> * Apache Commons: ALv2 >> * Apache Thrift: ALv2 >> * Apache Maven: ALv2 >> * Apache Avro: ALv2 >> * Apache Hadoop: ALv2 >> * Google Guava: ALv2 >> * Google Protobuf: New BSD License >>=20 >> =3D=3D Cryptography =3D=3D >>=20 >> We do not expect Parquet to be a controlled export item due to the use of= >> encryption. >>=20 >> =3D=3D Required Resources =3D=3D >>=20 >> =3D=3D=3D Mailing lists =3D=3D=3D >>=20 >> * private@parquet.incubator.apache.org >> * commits@parquet.incubator.apache.org >> * dev@parquet.incubator.apache.org >>=20 >> =3D=3D Subversion Directory =3D=3D >>=20 >> Git is the preferred source control system: >>=20 >> * git://git.apache.org/parquet-format >> * git://git.apache.org/parquet-mr >>=20 >> =3D=3D Issue Tracking =3D=3D >>=20 >> We'd like to keep using the Git review and issue tracking tools. >> Controlling Pull requests closing through git commit messages in >> git.apache.org >>=20 >> =3D=3D Initial Committers =3D=3D >>=20 >> * Aniket Mokashi >> * Brock Noland >> * Chris Aniszczyk >> * Dmitriy Ryaboy >> * Jake Farrell >> * Jonathan Coveney >> * Julien Le Dem >> * Lukas Nalezenec >> * Marcel Kornacker >> * Mickael Lacour >> * Nong Li >> * Remy Pecqueur >> * Ryan Blue >> * Tianshuo Deng >> * Tom White >> * Wesley Peck >>=20 >> =3D=3D Affiliations =3D=3D >>=20 >> * Aniket Mokashi - Twitter >> * Brock Noland - Cloudera >> * Chris Aniszczyk - Twitter >> * Dmitriy Ryaboy - Twitter >> * Jake Farrell >> * Jonathan Coveney - Twitter >> * Julien Le Dem - Twitter >> * Lukas Nalezenec >> * Marcel Kornacker - Cloudera >> * Mickael Lacour - Criteo >> * Nong Li - Cloudera >> * Remy Pecqueur - Criteo >> * Ryan Blue - Cloudera >> * Tianshuo Deng - Twitter >> * Tom White - Cloudera >> * Wesley Peck - ARRIS, Inc. >>=20 >> =3D=3D Sponsors =3D=3D >>=20 >> =3D=3D=3D Champion =3D=3D=3D >>=20 >> * Todd Lipcon >>=20 >> =3D=3D=3D Nominated Mentors =3D=3D=3D >>=20 >> * Tom White >> * Chris Mattmann >> * Jake Farrell >> * Roman Shaposhnik >>=20 >> =3D=3D=3D Sponsoring Entity =3D=3D=3D >>=20 >> The Apache Incubator >>=20 >> -- >> Cheers, >>=20 >> Chris Aniszczyk >> http://aniszczyk.org >> +1 512 961 6719 >>=20 --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org