incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: [VOTE] Accept Parquet into the incubator
Date Mon, 19 May 2014 00:33:46 GMT
+1 from me (binding)!

Cheers,
Chris


-----Original Message-----
From: Chris Aniszczyk <caniszczyk@gmail.com>
Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
Date: Sunday, May 18, 2014 2:15 PM
To: "general@incubator.apache.org" <general@incubator.apache.org>
Subject: [VOTE] Accept Parquet into the incubator

>Based on the results of the discussion thread:
>http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3C
>CAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3
>E
>
>I would like to call a vote on accepting Parquet into the incubator.
>https://wiki.apache.org/incubator/ParquetProposal
>
>[ ] +1 Accept Parquet into the Incubator
>[ ] +0 Indifferent to the acceptance of Parquet
>[ ] -1 Do not accept Parquet because ...
>
>The vote will be open until Thursday May 22nd 18:00 UTC.
>
>= Parquet Proposal =
>
>== Abstract ==
>Parquet is a columnar storage format for Hadoop.
>
>== Proposal ==
>
>We created Parquet to make the advantages of compressed, efficient
>columnar
>data representation available to any project in the Hadoop ecosystem,
>regardless of the choice of data processing framework, data model, or
>programming language.
>
>== Background ==
>
>Parquet is built from the ground up with complex nested data structures in
>mind, and uses the repetition/definition level approach to encoding such
>data structures, as popularized by Google Dremel (
>https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
>this approach is superior to simple flattening of nested name spaces.
>
>Parquet is built to support very efficient compression and encoding
>schemes. Parquet allows compression schemes to be specified on a
>per-column
>level, and is future-proofed to allow adding more encodings as they are
>invented and implemented. We separate the concepts of encoding and
>compression, allowing parquet consumers to implement operators that work
>directly on encoded data without paying decompression and decoding penalty
>when possible.
>
>== Rationale ==
>
>Parquet is built to be used by anyone. We believe that an efficient,
>well-implemented columnar storage substrate should be useful to all
>frameworks without the cost of extensive and difficult to set up
>dependencies.
>
>Furthermore, the rapid growth of Parquet community is empowered by open
>source. We believe the Apache foundation is a great fit as the long-term
>home for Parquet, as it provides an established process for
>community-driven development and decision making by consensus. This is
>exactly the model we want for future Parquet development.
>
>== Initial Goals ==
>
> * Move the existing codebase to Apache
> * Integrate with the Apache development process
> * Ensure all dependencies are compliant with Apache License version 2.0
> * Incremental development and releases per Apache guidelines
>
>== Current Status ==
>
>Parquet has undergone 2 major releases:
>https://github.com/Parquet/parquet-format/releases of the core format and
>22 releases: https://github.com/Parquet/parquet-mr/releases of the
>supporting set of Java libraries.
>
>The Parquet source is currently hosted at GitHub, which will seed the
>Apache git repository.
>
>=== Meritocracy ===
>
>We plan to invest in supporting a meritocracy. We will discuss the
>requirements in an open forum. Several companies have already expressed
>interest in this project, and we intend to invite additional developers to
>participate. We will encourage and monitor community participation so that
>privileges can be extended to those that contribute.
>
>=== Community ===
>
>There is a large need for an advanced columnar storage format for Hadoop.
>Parquet is being used in production by many organizations (see
>https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>
> * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
> * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
> * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
> * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
> * Twitter: https://twitter.com/J_/statuses/315844725611581441
>
>By bringing Parquet into Apache, we believe that the community will grow
>even bigger.
>
>=== Core Developers ===
>
>Parquet was initially developed as a collaboration between Twitter,
>Cloudera and Criteo.
>
>See
>https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-h
>adoop
>
>=== Alignment ===
>
>We believe that having Parquet at Apache will help further the growth of
>the big-data community, as it will encourage cooperation within the
>greater
>ecosystem of projects spawned by Apache Hadoop. The alignment is also
>beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>
>== Known Risks ==
>
>=== Orphaned Products ===
>
>The risk of the Parquet project being abandoned is minimal. There are many
>organizations using Parquet in production, including Twitter, Cloudera,
>Stripe, and Salesforce (
>http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>
>=== Inexperience with Open Source ===
>
>Parquet has existed as a healthy open source for one year. During that
>time, we have curated an open-source community successfully, attracting
>over 40 contributors (see
>https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
>group of companies.
>Several of the core contributors to the project are deeply familiar with
>OSS and Apache specifically: Julien Le Dem was until recently the PMC
>Chair
>for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
>are also Apache Pig committers with contributions to several other Apache
>projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
>multiple other related projects. Brock Noland is a Hive committer.
>
>=== Homogenous Developers ===
>
>The initial committers come from a number of companies and countries.
>Parquet has an active community of developers, and we are committed to
>recruiting additional committers based on their contributions to the
>project. The java library component alone has contributions from 31
>individual github accounts, 14 of which contributed over 1000 lines of
>code.
>
>=== Reliance on Salaried Developers ===
>
>It is expected that Parquet development will occur on both salaried time
>and on volunteer time, after hours. The majority of initial committers are
>paid by their employers to contribute to this project. However, they are
>all passionate about the project, and we are confident that the project
>will continue even if no salaried developers contribute to the project. As
>evidence of this statement, we present the GitHub punchcard (see
>https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
>lot
>of activity happens on weekends. We are committed to recruiting additional
>committers including non-salaried developers.
>
>=== Relationships with Other Apache Products ===
>
>As mentioned in the Alignment section, Parquet is closely related to
>Hadoop. It provides an API that allowed it to be easily integrated with
>many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
>Tajo. Some of the features it provides are similar to the ORC file format
>which is part of the Hive project. However Parquet focused on being
>framework agnostic and language independent and has been really successful
>to that end. On top of the Apache projects mentioned above, Parquet is
>also
>integrated with other open source projects, including Protocol Buffers,
>Cloudera Impala or Scrooge. We look forward to continue collaborating with
>those communities, as well as other Apache communities.
>
>=== An Excessive Fascination with the Apache Brand ===
>
>Parquet is an already healthy and well known open source project. This
>proposal is not for the purpose of generating publicity. Rather, the
>primary benefits to joining Apache are those outlined in the Rationale
>section.
>
>== Documentation ==
>
>Documentation is currently located as README markdown files:
>
> * https://github.com/Parquet/parquet-format
> * https://github.com/Parquet/parquet-mr
>
>== Source and Intellectual Property Submission Plan ==
>
>The Parquet codebase is currently hosted on Github:
>https://github.com/Parquet.
>
>These are the codebases that we would migrate to the Apache foundation.
>
>== External Dependencies ==
>
>
> * Junit: EPL
> * Apache Commons: ALv2
> * Apache Thrift: ALv2
> * Apache Maven: ALv2
> * Apache Avro: ALv2
> * Apache Hadoop: ALv2
> * Google Guava: ALv2
> * Google Protobuf: New BSD License
>
>== Cryptography ==
>
>We do not expect Parquet to be a controlled export item due to the use of
>encryption.
>
>== Required Resources ==
>
>=== Mailing lists ===
>
> * private@parquet.incubator.apache.org
> * commits@parquet.incubator.apache.org
> * dev@parquet.incubator.apache.org
>
>== Subversion Directory ==
>
>Git is the preferred source control system:
>
> * git://git.apache.org/parquet-format
> * git://git.apache.org/parquet-mr
>
>== Issue Tracking ==
>
>We'd like to keep using the Git review and issue tracking tools.
>Controlling Pull requests closing through git commit messages in
>git.apache.org
>
>== Initial Committers ==
>
> * Aniket Mokashi <aniket486@gmail.com>
> * Brock Noland <brock@apache.org>
> * Chris Aniszczyk <caniszczyk@gmail.com>
> * Dmitriy Ryaboy <dvryaboy@apache.org>
> * Jake Farrell <jfarrell@apache.org>
> * Jonathan Coveney <jcoveney@gmail.com>
> * Julien Le Dem <julien@apache.org>
> * Lukas Nalezenec <lukas.nalezenec@gmail.com>
> * Marcel Kornacker <marcel@cloudera.com>
> * Mickael Lacour
> * Nong Li <nong@cloudera.com>
> * Remy Pecqueur
> * Ryan Blue <blue@cloudera.com>
> * Tianshuo Deng <dengtianshuo@gmail.com>
> * Tom White <tomwhite@apache.org>
> * Wesley Peck
>
>== Affiliations ==
>
> * Aniket Mokashi - Twitter
> * Brock Noland - Cloudera
> * Chris Aniszczyk - Twitter
> * Dmitriy Ryaboy - Twitter
> * Jake Farrell
> * Jonathan Coveney - Twitter
> * Julien Le Dem - Twitter
> * Lukas Nalezenec
> * Marcel Kornacker - Cloudera
> * Mickael Lacour - Criteo
> * Nong Li - Cloudera
> * Remy Pecqueur - Criteo
> * Ryan Blue - Cloudera
> * Tianshuo Deng - Twitter
> * Tom White - Cloudera
> * Wesley Peck - ARRIS, Inc.
>
>== Sponsors ==
>
>=== Champion ===
>
> * Todd Lipcon
>
>=== Nominated Mentors ===
>
> * Tom White
> * Chris Mattmann
> * Jake Farrell
> * Roman Shaposhnik
>
>=== Sponsoring Entity ===
>
>The Apache Incubator
>
>-- 
>Cheers,
>
>Chris Aniszczyk
>http://aniszczyk.org
>+1 512 961 6719


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message