Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1BDEC1197B for ; Wed, 14 May 2014 19:50:40 +0000 (UTC) Received: (qmail 55411 invoked by uid 500); 14 May 2014 17:00:38 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 55243 invoked by uid 500); 14 May 2014 17:00:38 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 55233 invoked by uid 99); 14 May 2014 17:00:38 -0000 Received: from Unknown (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 May 2014 17:00:38 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of todd@cloudera.com designates 209.85.215.44 as permitted sender) Received: from [209.85.215.44] (HELO mail-la0-f44.google.com) (209.85.215.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 May 2014 17:00:34 +0000 Received: by mail-la0-f44.google.com with SMTP id hr17so1675402lab.31 for ; Wed, 14 May 2014 10:00:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=27LIY0CRypVE3U1mU1G2YiGWQoa6vkwVdDXiZfOucAU=; b=OX9FhBDX3YpCi7/OGiFSPwDkhX2mXfpueIvqURqUKWOcThk/VJHz7SCJRFfmpJR2gO x87e8jch5Z/URO7UJrV8srrapiqrcZ2BhLPAxXXlHpei6VhOi9GgxY87NGWSYNqRJjkl ZnnTYvzCAOT+U0wvakBu4ZnqK4JydlfnLGv3usVNa7P1xyKwes82KVDtxDuYp1Js61ZP n6/s5n5dOgM3sZu1n5HJoDW6N6RI+5uwNu/3x7He9H4LPLGrMugcmgObEykGcN7ZSHS5 4Y6RvCMyokQ5I2cr6MqOnCByHOvsZRVW7B8Muu2XoG1WZHFqjrX7dZODyJJm5VNyWlyp 8d/Q== X-Gm-Message-State: ALoCoQne/uRiAykJoGtu4c23UlIVCC9DbWF10TSrxNNRBn8IqXOfRUx1EK98Z3fanMRcTUCdKzeb X-Received: by 10.112.4.106 with SMTP id j10mr3194494lbj.7.1400086812885; Wed, 14 May 2014 10:00:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.217.73 with HTTP; Wed, 14 May 2014 09:59:52 -0700 (PDT) In-Reply-To: References: From: Todd Lipcon Date: Wed, 14 May 2014 09:59:52 -0700 Message-ID: Subject: Re: [PROPOSAL] Parquet To: general@incubator.apache.org Content-Type: multipart/alternative; boundary=14dae9473a67232a4b04f95f1d96 X-Virus-Checked: Checked by ClamAV on apache.org --14dae9473a67232a4b04f95f1d96 Content-Type: text/plain; charset=UTF-8 Proposal looks good to me! Eagerly awaiting the vote. -Todd On Mon, May 12, 2014 at 10:02 AM, Chris Aniszczyk wrote: > We would like to propose Parquet as an Apache Incubator project. > https://wiki.apache.org/incubator/ParquetProposal > > Feel free to comment, we'll go for a vote in a week or two or whenever > consensus has been reached on the proposal. > > I've posted posted the text of the proposal below: > > == Abstract == > Parquet is a columnar storage format for Hadoop. > > == Proposal == > > We created Parquet to make the advantages of compressed, efficient columnar > data representation available to any project in the Hadoop ecosystem, > regardless of the choice of data processing framework, data model, or > programming language. > > == Background == > > Parquet is built from the ground up with complex nested data structures in > mind, and uses the repetition/definition level approach to encoding such > data structures, as popularized by Google Dremel ( > https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe > this approach is superior to simple flattening of nested name spaces. > > Parquet is built to support very efficient compression and encoding > schemes. Parquet allows compression schemes to be specified on a per-column > level, and is future-proofed to allow adding more encodings as they are > invented and implemented. We separate the concepts of encoding and > compression, allowing parquet consumers to implement operators that work > directly on encoded data without paying decompression and decoding penalty > when possible. > > == Rationale == > > Parquet is built to be used by anyone. We believe that an efficient, > well-implemented columnar storage substrate should be useful to all > frameworks without the cost of extensive and difficult to set up > dependencies. > > Furthermore, the rapid growth of Parquet community is empowered by open > source. We believe the Apache foundation is a great fit as the long-term > home for Parquet, as it provides an established process for > community-driven development and decision making by consensus. This is > exactly the model we want for future Parquet development. > > == Initial Goals == > > * Move the existing codebase to Apache > * Integrate with the Apache development process > * Ensure all dependencies are compliant with Apache License version 2.0 > * Incremental development and releases per Apache guidelines > > == Current Status == > > Parquet has undergone 2 major releases: > https://github.com/Parquet/parquet-format/releases of the core format and > 22 releases: https://github.com/Parquet/parquet-mr/releases of the > supporting set of Java libraries. > > The Parquet source is currently hosted at GitHub, which will seed the > Apache git repository. > > === Meritocracy === > > We plan to invest in supporting a meritocracy. We will discuss the > requirements in an open forum. Several companies have already expressed > interest in this project, and we intend to invite additional developers to > participate. We will encourage and monitor community participation so that > privileges can be extended to those that contribute. > > === Community === > > There is a large need for an advanced columnar storage format for Hadoop. > Parquet is being used in production by many organizations (see > https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md) > > * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392 > * Criteo: https://twitter.com/julsimon/statuses/312114074911666177 > * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784 > * Stripe: https://twitter.com/avibryant/statuses/391339949250715648 > * Twitter: https://twitter.com/J_/statuses/315844725611581441 > > By bringing Parquet into Apache, we believe that the community will grow > even bigger. > > === Core Developers === > > Parquet was initially developed as a collaboration between Twitter, > Cloudera and Criteo. > > See > > https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop > > === Alignment === > > We believe that having Parquet at Apache will help further the growth of > the big-data community, as it will encourage cooperation within the greater > ecosystem of projects spawned by Apache Hadoop. The alignment is also > beneficial to other Apache communities (such as Hadoop, Hive, Avro). > > == Known Risks == > > === Orphaned Products === > > The risk of the Parquet project being abandoned is minimal. There are many > organizations using Parquet in production, including Twitter, Cloudera, > Stripe, and Salesforce ( > http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/). > > === Inexperience with Open Source === > > Parquet has existed as a healthy open source for one year. During that > time, we have curated an open-source community successfully, attracting > over 40 contributors (see > https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse > group of companies. > Several of the core contributors to the project are deeply familiar with > OSS and Apache specifically: Julien Le Dem is the current PMC Chair for > Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney are > also Apache Pig committers with contributions to several other Apache > projects. Todd Lipcon and Tom White are committers to Apache Hadoop and > multiple other related projects. Brock Noland is a Hive committer. > > === Homogenous Developers === > > The initial committers come from a number of companies and countries. > Parquet has an active community of developers, and we are committed to > recruiting additional committers based on their contributions to the > project. The java library component alone has contributions from 31 > individual github accounts, 14 of which contributed over 1000 lines of > code. > > === Reliance on Salaried Developers === > > It is expected that Parquet development will occur on both salaried time > and on volunteer time, after hours. The majority of initial committers are > paid by their employers to contribute to this project. However, they are > all passionate about the project, and we are confident that the project > will continue even if no salaried developers contribute to the project. As > evidence of this statement, we present the GitHub punchcard (see > https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a > lot > of activity happens on weekends. We are committed to recruiting additional > committers including non-salaried developers. > > === Relationships with Other Apache Products === > > As mentioned in the Alignment section, Parquet is closely related to > Hadoop, Pig, Avro, Thrift, YARN and Mesos in a numerous ways. We look > forward to collaborating with those communities, as well as other Apache > communities (including Apache S4 which focuses on stateful low-latency > processing). > > === An Excessive Fascination with the Apache Brand === > > Parquet is an already healthy and well known open source project. This > proposal is not for the purpose of generating publicity. Rather, the > primary benefits to joining Apache are those outlined in the Rationale > section. > > == Documentation == > > Documentation is currently located as README markdown files: > > * https://github.com/Parquet/parquet-format > * https://github.com/Parquet/parquet-mr > > == Source and Intellectual Property Submission Plan == > > The Parquet codebase is currently hosted on Github: > https://github.com/Parquet. > > This is the exact codebase that we would migrate to the Apache foundation. > > == External Dependencies == > > * Junit: EPL > * Apache Commons: ALv2 > * Apache Thrift: ALv2 > * Apache Maven: ALv2 > * Apache Avro: ALv2 > * Apache Hadoop: ALv2 > * Google Guava: ALv2 > > == Cryptography == > > We do not expect Parquet to be a controlled export item due to the use of > encryption. > > == Required Resources == > > === Mailing lists === > > * parquet-dev > * parquet-user > > == Subversion Directory == > > Git is the preferred source control system: git://git.apache.org/parquet > > == Issue Tracking == > > JIRA: Parquet (PARQUET) > > == Initial Committers == > > * Aniket Mokashi > * Brock Noland > * Chris Aniszczyk > * Dmitriy Ryaboy > * Jake Farrell > * Julien Le Dem > * Lukas Nalezenec > * Marcel Kornacker > * Mickael Lacour > * Nong Li > * Remy Pecqueur > * Tianshuo Deng > * Tom White > > == Affiliations == > > * Aniket Mokashi - Twitter > * Brock Noland - Cloudera > * Chris Aniszczyk - Twitter > * Dmitriy Ryaboy - Twitter > * Jake Farrell > * Julien Le Dem - Twitter > * Lukas Nalezenec > * Marcel Kornacker - Cloudera > * Mickael Lacour - Criteo > * Nong Li - Cloudera > * Remy Pecqueur - Criteo > * Tianshuo Deng - Twitter > * Tom White - Cloudera > > == Sponsors == > > === Champion === > > * Todd Lipcon > > === Nominated Mentors === > > * Tom White > * Chris Mattmann > * Jake Farrell > > === Sponsoring Entity === > > The Apache Incubator > > -- > Cheers, > > Chris Aniszczyk > http://aniszczyk.org > +1 512 961 6719 > -- Todd Lipcon Software Engineer, Cloudera --14dae9473a67232a4b04f95f1d96--