incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Chen <tnac...@gmail.com>
Subject Re: [VOTE] Accept Parquet into the incubator
Date Mon, 19 May 2014 01:28:09 GMT
+1 non-binding.

Tim


> On May 18, 2014, at 6:14 PM, Jake Farrell <jfarrell@apache.org> wrote:
> 
> +1 (binding)
> 
> -Jake
> 
> 
> 
> On Sun, May 18, 2014 at 5:15 PM, Chris Aniszczyk <caniszczyk@gmail.com>wrote:
> 
>> Based on the results of the discussion thread:
>> 
>> http://mail-archives.apache.org/mod_mbox/incubator-general/201405.mbox/%3CCAJg1wMRGhLu4P7LeVQB%2B5K0C-fr-pw2448uj%3D6-3zHag4F1EbA%40mail.gmail.com%3E
>> 
>> I would like to call a vote on accepting Parquet into the incubator.
>> https://wiki.apache.org/incubator/ParquetProposal
>> 
>> [ ] +1 Accept Parquet into the Incubator
>> [ ] +0 Indifferent to the acceptance of Parquet
>> [ ] -1 Do not accept Parquet because ...
>> 
>> The vote will be open until Thursday May 22nd 18:00 UTC.
>> 
>> = Parquet Proposal =
>> 
>> == Abstract ==
>> Parquet is a columnar storage format for Hadoop.
>> 
>> == Proposal ==
>> 
>> We created Parquet to make the advantages of compressed, efficient columnar
>> data representation available to any project in the Hadoop ecosystem,
>> regardless of the choice of data processing framework, data model, or
>> programming language.
>> 
>> == Background ==
>> 
>> Parquet is built from the ground up with complex nested data structures in
>> mind, and uses the repetition/definition level approach to encoding such
>> data structures, as popularized by Google Dremel (
>> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
>> this approach is superior to simple flattening of nested name spaces.
>> 
>> Parquet is built to support very efficient compression and encoding
>> schemes. Parquet allows compression schemes to be specified on a per-column
>> level, and is future-proofed to allow adding more encodings as they are
>> invented and implemented. We separate the concepts of encoding and
>> compression, allowing parquet consumers to implement operators that work
>> directly on encoded data without paying decompression and decoding penalty
>> when possible.
>> 
>> == Rationale ==
>> 
>> Parquet is built to be used by anyone. We believe that an efficient,
>> well-implemented columnar storage substrate should be useful to all
>> frameworks without the cost of extensive and difficult to set up
>> dependencies.
>> 
>> Furthermore, the rapid growth of Parquet community is empowered by open
>> source. We believe the Apache foundation is a great fit as the long-term
>> home for Parquet, as it provides an established process for
>> community-driven development and decision making by consensus. This is
>> exactly the model we want for future Parquet development.
>> 
>> == Initial Goals ==
>> 
>> * Move the existing codebase to Apache
>> * Integrate with the Apache development process
>> * Ensure all dependencies are compliant with Apache License version 2.0
>> * Incremental development and releases per Apache guidelines
>> 
>> == Current Status ==
>> 
>> Parquet has undergone 2 major releases:
>> https://github.com/Parquet/parquet-format/releases of the core format and
>> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
>> supporting set of Java libraries.
>> 
>> The Parquet source is currently hosted at GitHub, which will seed the
>> Apache git repository.
>> 
>> === Meritocracy ===
>> 
>> We plan to invest in supporting a meritocracy. We will discuss the
>> requirements in an open forum. Several companies have already expressed
>> interest in this project, and we intend to invite additional developers to
>> participate. We will encourage and monitor community participation so that
>> privileges can be extended to those that contribute.
>> 
>> === Community ===
>> 
>> There is a large need for an advanced columnar storage format for Hadoop.
>> Parquet is being used in production by many organizations (see
>> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>> 
>> * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>> * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>> * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>> * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>> * Twitter: https://twitter.com/J_/statuses/315844725611581441
>> 
>> By bringing Parquet into Apache, we believe that the community will grow
>> even bigger.
>> 
>> === Core Developers ===
>> 
>> Parquet was initially developed as a collaboration between Twitter,
>> Cloudera and Criteo.
>> 
>> See
>> 
>> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>> 
>> === Alignment ===
>> 
>> We believe that having Parquet at Apache will help further the growth of
>> the big-data community, as it will encourage cooperation within the greater
>> ecosystem of projects spawned by Apache Hadoop. The alignment is also
>> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>> 
>> == Known Risks ==
>> 
>> === Orphaned Products ===
>> 
>> The risk of the Parquet project being abandoned is minimal. There are many
>> organizations using Parquet in production, including Twitter, Cloudera,
>> Stripe, and Salesforce (
>> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>> 
>> === Inexperience with Open Source ===
>> 
>> Parquet has existed as a healthy open source for one year. During that
>> time, we have curated an open-source community successfully, attracting
>> over 40 contributors (see
>> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
>> group of companies.
>> Several of the core contributors to the project are deeply familiar with
>> OSS and Apache specifically: Julien Le Dem was until recently the PMC Chair
>> for Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney
>> are also Apache Pig committers with contributions to several other Apache
>> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
>> multiple other related projects. Brock Noland is a Hive committer.
>> 
>> === Homogenous Developers ===
>> 
>> The initial committers come from a number of companies and countries.
>> Parquet has an active community of developers, and we are committed to
>> recruiting additional committers based on their contributions to the
>> project. The java library component alone has contributions from 31
>> individual github accounts, 14 of which contributed over 1000 lines of
>> code.
>> 
>> === Reliance on Salaried Developers ===
>> 
>> It is expected that Parquet development will occur on both salaried time
>> and on volunteer time, after hours. The majority of initial committers are
>> paid by their employers to contribute to this project. However, they are
>> all passionate about the project, and we are confident that the project
>> will continue even if no salaried developers contribute to the project. As
>> evidence of this statement, we present the GitHub punchcard (see
>> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
>> lot
>> of activity happens on weekends. We are committed to recruiting additional
>> committers including non-salaried developers.
>> 
>> === Relationships with Other Apache Products ===
>> 
>> As mentioned in the Alignment section, Parquet is closely related to
>> Hadoop. It provides an API that allowed it to be easily integrated with
>> many other apache projects: Pig, Hive, Avro, Thrift, Spark, Drill, Crunch,
>> Tajo. Some of the features it provides are similar to the ORC file format
>> which is part of the Hive project. However Parquet focused on being
>> framework agnostic and language independent and has been really successful
>> to that end. On top of the Apache projects mentioned above, Parquet is also
>> integrated with other open source projects, including Protocol Buffers,
>> Cloudera Impala or Scrooge. We look forward to continue collaborating with
>> those communities, as well as other Apache communities.
>> 
>> === An Excessive Fascination with the Apache Brand ===
>> 
>> Parquet is an already healthy and well known open source project. This
>> proposal is not for the purpose of generating publicity. Rather, the
>> primary benefits to joining Apache are those outlined in the Rationale
>> section.
>> 
>> == Documentation ==
>> 
>> Documentation is currently located as README markdown files:
>> 
>> * https://github.com/Parquet/parquet-format
>> * https://github.com/Parquet/parquet-mr
>> 
>> == Source and Intellectual Property Submission Plan ==
>> 
>> The Parquet codebase is currently hosted on Github:
>> https://github.com/Parquet.
>> 
>> These are the codebases that we would migrate to the Apache foundation.
>> 
>> == External Dependencies ==
>> 
>> 
>> * Junit: EPL
>> * Apache Commons: ALv2
>> * Apache Thrift: ALv2
>> * Apache Maven: ALv2
>> * Apache Avro: ALv2
>> * Apache Hadoop: ALv2
>> * Google Guava: ALv2
>> * Google Protobuf: New BSD License
>> 
>> == Cryptography ==
>> 
>> We do not expect Parquet to be a controlled export item due to the use of
>> encryption.
>> 
>> == Required Resources ==
>> 
>> === Mailing lists ===
>> 
>> * private@parquet.incubator.apache.org
>> * commits@parquet.incubator.apache.org
>> * dev@parquet.incubator.apache.org
>> 
>> == Subversion Directory ==
>> 
>> Git is the preferred source control system:
>> 
>> * git://git.apache.org/parquet-format
>> * git://git.apache.org/parquet-mr
>> 
>> == Issue Tracking ==
>> 
>> We'd like to keep using the Git review and issue tracking tools.
>> Controlling Pull requests closing through git commit messages in
>> git.apache.org
>> 
>> == Initial Committers ==
>> 
>> * Aniket Mokashi <aniket486@gmail.com>
>> * Brock Noland <brock@apache.org>
>> * Chris Aniszczyk <caniszczyk@gmail.com>
>> * Dmitriy Ryaboy <dvryaboy@apache.org>
>> * Jake Farrell <jfarrell@apache.org>
>> * Jonathan Coveney <jcoveney@gmail.com>
>> * Julien Le Dem <julien@apache.org>
>> * Lukas Nalezenec <lukas.nalezenec@gmail.com>
>> * Marcel Kornacker <marcel@cloudera.com>
>> * Mickael Lacour
>> * Nong Li <nong@cloudera.com>
>> * Remy Pecqueur
>> * Ryan Blue <blue@cloudera.com>
>> * Tianshuo Deng <dengtianshuo@gmail.com>
>> * Tom White <tomwhite@apache.org>
>> * Wesley Peck
>> 
>> == Affiliations ==
>> 
>> * Aniket Mokashi - Twitter
>> * Brock Noland - Cloudera
>> * Chris Aniszczyk - Twitter
>> * Dmitriy Ryaboy - Twitter
>> * Jake Farrell
>> * Jonathan Coveney - Twitter
>> * Julien Le Dem - Twitter
>> * Lukas Nalezenec
>> * Marcel Kornacker - Cloudera
>> * Mickael Lacour - Criteo
>> * Nong Li - Cloudera
>> * Remy Pecqueur - Criteo
>> * Ryan Blue - Cloudera
>> * Tianshuo Deng - Twitter
>> * Tom White - Cloudera
>> * Wesley Peck - ARRIS, Inc.
>> 
>> == Sponsors ==
>> 
>> === Champion ===
>> 
>> * Todd Lipcon
>> 
>> === Nominated Mentors ===
>> 
>> * Tom White
>> * Chris Mattmann
>> * Jake Farrell
>> * Roman Shaposhnik
>> 
>> === Sponsoring Entity ===
>> 
>> The Apache Incubator
>> 
>> --
>> Cheers,
>> 
>> Chris Aniszczyk
>> http://aniszczyk.org
>> +1 512 961 6719
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message