Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 137C7119D5 for ; Sat, 17 May 2014 19:06:06 +0000 (UTC) Received: (qmail 34469 invoked by uid 500); 17 May 2014 18:10:09 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 10601 invoked by uid 500); 17 May 2014 17:45:09 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 95619 invoked by uid 99); 17 May 2014 17:26:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 May 2014 17:26:39 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of caniszczyk@gmail.com designates 209.85.128.181 as permitted sender) Received: from [209.85.128.181] (HELO mail-ve0-f181.google.com) (209.85.128.181) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 May 2014 17:26:34 +0000 Received: by mail-ve0-f181.google.com with SMTP id pa12so4674515veb.12 for ; Sat, 17 May 2014 10:26:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=PtF0f/b/bUyYKYwiEwzQopvuCqcuwWMC7cBSdb84y1M=; b=WT8o1Af4MndxoNjr1e+lDUpz5p5FfsjwdW3634SFJppXrMfDLH3/66tSi0fUeNaNi9 6XUWLgNHrWpVxPKH2s/umaDEF/585pYkanAN9GG/1FDteHrNdhm13FystHngvW0nxGVp QeoJWu5IfhAFfhl2jV1Cgx4Tal4zFAGssMCkxMjPMe1iSY+2juDnKx9bUTOl6T2NkfLr EPmoyUL653Y52mGm5IjKFgZAjXODZVThhicUzWawHLj3CsP6dDNZrMYffpvLdGJ/ekLl sojq0CmTW8K4YohjULMbwr+zJCAI+uDnkS8P8bC69gemV0KBrTOo0Uh8kLIFxfKIcWZE 85rA== MIME-Version: 1.0 X-Received: by 10.52.11.230 with SMTP id t6mr3089181vdb.27.1400347572371; Sat, 17 May 2014 10:26:12 -0700 (PDT) Received: by 10.221.25.130 with HTTP; Sat, 17 May 2014 10:26:12 -0700 (PDT) In-Reply-To: References: Date: Sat, 17 May 2014 10:26:12 -0700 Message-ID: Subject: Re: [PROPOSAL] Parquet From: Chris Aniszczyk To: general@incubator.apache.org Content-Type: multipart/alternative; boundary=485b397dd6299d07fe04f99bd3fa X-Virus-Checked: Checked by ClamAV on apache.org --485b397dd6299d07fe04f99bd3fa Content-Type: text/plain; charset=UTF-8 Your request about the user list seems fine, no need to have multiple lists atm IMHO. The proposal has been updated accordingly, thanks! https://wiki.apache.org/incubator/ParquetProposal?action=recall&rev=21 On Sat, May 17, 2014 at 9:33 AM, Henry Saputra wrote: > Chris, could you please address my concern about user@ list > > - Henry > > On Fri, May 16, 2014 at 4:43 PM, Chris Aniszczyk > wrote: > > SGTM Roman, thanks for volunteering! > > > > I'll start the vote on Sunday barring any issues. > > > > > > On Fri, May 16, 2014 at 11:56 AM, Roman Shaposhnik > wrote: > > > >> Hi! > >> > >> proposal looks good to me and I am very much looking > >> for a voting thread. > >> > >> One small request, since I plan to spend a fair amount > >> of time on Parquet anyway, would you guys be ok > >> with adding me as an extra mentor so I can help > >> with that aspect of the project as well? > >> > >> Thanks, > >> Roman. > >> > >> P.S. Plus it has an added benefit of increasing diversity > >> of affiliations from the get go. > >> > >> On Mon, May 12, 2014 at 10:02 AM, Chris Aniszczyk > > >> wrote: > >> > We would like to propose Parquet as an Apache Incubator project. > >> > https://wiki.apache.org/incubator/ParquetProposal > >> > > >> > Feel free to comment, we'll go for a vote in a week or two or whenever > >> > consensus has been reached on the proposal. > >> > > >> > I've posted posted the text of the proposal below: > >> > > >> > == Abstract == > >> > Parquet is a columnar storage format for Hadoop. > >> > > >> > == Proposal == > >> > > >> > We created Parquet to make the advantages of compressed, efficient > >> columnar > >> > data representation available to any project in the Hadoop ecosystem, > >> > regardless of the choice of data processing framework, data model, or > >> > programming language. > >> > > >> > == Background == > >> > > >> > Parquet is built from the ground up with complex nested data > structures > >> in > >> > mind, and uses the repetition/definition level approach to encoding > such > >> > data structures, as popularized by Google Dremel ( > >> > https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We > >> believe > >> > this approach is superior to simple flattening of nested name spaces. > >> > > >> > Parquet is built to support very efficient compression and encoding > >> > schemes. Parquet allows compression schemes to be specified on a > >> per-column > >> > level, and is future-proofed to allow adding more encodings as they > are > >> > invented and implemented. We separate the concepts of encoding and > >> > compression, allowing parquet consumers to implement operators that > work > >> > directly on encoded data without paying decompression and decoding > >> penalty > >> > when possible. > >> > > >> > == Rationale == > >> > > >> > Parquet is built to be used by anyone. We believe that an efficient, > >> > well-implemented columnar storage substrate should be useful to all > >> > frameworks without the cost of extensive and difficult to set up > >> > dependencies. > >> > > >> > Furthermore, the rapid growth of Parquet community is empowered by > open > >> > source. We believe the Apache foundation is a great fit as the > long-term > >> > home for Parquet, as it provides an established process for > >> > community-driven development and decision making by consensus. This is > >> > exactly the model we want for future Parquet development. > >> > > >> > == Initial Goals == > >> > > >> > * Move the existing codebase to Apache > >> > * Integrate with the Apache development process > >> > * Ensure all dependencies are compliant with Apache License version > 2.0 > >> > * Incremental development and releases per Apache guidelines > >> > > >> > == Current Status == > >> > > >> > Parquet has undergone 2 major releases: > >> > https://github.com/Parquet/parquet-format/releases of the core format > >> and > >> > 22 releases: https://github.com/Parquet/parquet-mr/releases of the > >> > supporting set of Java libraries. > >> > > >> > The Parquet source is currently hosted at GitHub, which will seed the > >> > Apache git repository. > >> > > >> > === Meritocracy === > >> > > >> > We plan to invest in supporting a meritocracy. We will discuss the > >> > requirements in an open forum. Several companies have already > expressed > >> > interest in this project, and we intend to invite additional > developers > >> to > >> > participate. We will encourage and monitor community participation so > >> that > >> > privileges can be extended to those that contribute. > >> > > >> > === Community === > >> > > >> > There is a large need for an advanced columnar storage format for > Hadoop. > >> > Parquet is being used in production by many organizations (see > >> > https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md) > >> > > >> > * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392 > >> > * Criteo: https://twitter.com/julsimon/statuses/312114074911666177 > >> > * Salesforce: > >> https://twitter.com/TwitterOSS/statuses/392734610116726784 > >> > * Stripe: https://twitter.com/avibryant/statuses/391339949250715648 > >> > * Twitter: https://twitter.com/J_/statuses/315844725611581441 > >> > > >> > By bringing Parquet into Apache, we believe that the community will > grow > >> > even bigger. > >> > > >> > === Core Developers === > >> > > >> > Parquet was initially developed as a collaboration between Twitter, > >> > Cloudera and Criteo. > >> > > >> > See > >> > > >> > https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop > >> > > >> > === Alignment === > >> > > >> > We believe that having Parquet at Apache will help further the growth > of > >> > the big-data community, as it will encourage cooperation within the > >> greater > >> > ecosystem of projects spawned by Apache Hadoop. The alignment is also > >> > beneficial to other Apache communities (such as Hadoop, Hive, Avro). > >> > > >> > == Known Risks == > >> > > >> > === Orphaned Products === > >> > > >> > The risk of the Parquet project being abandoned is minimal. There are > >> many > >> > organizations using Parquet in production, including Twitter, > Cloudera, > >> > Stripe, and Salesforce ( > >> > http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/). > >> > > >> > === Inexperience with Open Source === > >> > > >> > Parquet has existed as a healthy open source for one year. During that > >> > time, we have curated an open-source community successfully, > attracting > >> > over 40 contributors (see > >> > https://github.com/Parquet/parquet-mr/graphs/contributors) from a > >> diverse > >> > group of companies. > >> > Several of the core contributors to the project are deeply familiar > with > >> > OSS and Apache specifically: Julien Le Dem is the current PMC Chair > for > >> > Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney > are > >> > also Apache Pig committers with contributions to several other Apache > >> > projects. Todd Lipcon and Tom White are committers to Apache Hadoop > and > >> > multiple other related projects. Brock Noland is a Hive committer. > >> > > >> > === Homogenous Developers === > >> > > >> > The initial committers come from a number of companies and countries. > >> > Parquet has an active community of developers, and we are committed to > >> > recruiting additional committers based on their contributions to the > >> > project. The java library component alone has contributions from 31 > >> > individual github accounts, 14 of which contributed over 1000 lines of > >> code. > >> > > >> > === Reliance on Salaried Developers === > >> > > >> > It is expected that Parquet development will occur on both salaried > time > >> > and on volunteer time, after hours. The majority of initial committers > >> are > >> > paid by their employers to contribute to this project. However, they > are > >> > all passionate about the project, and we are confident that the > project > >> > will continue even if no salaried developers contribute to the > project. > >> As > >> > evidence of this statement, we present the GitHub punchcard (see > >> > https://github.com/Parquet/parquet-mr/graphs/punch-card) showing > that a > >> lot > >> > of activity happens on weekends. We are committed to recruiting > >> additional > >> > committers including non-salaried developers. > >> > > >> > === Relationships with Other Apache Products === > >> > > >> > As mentioned in the Alignment section, Parquet is closely related to > >> > Hadoop, Pig, Avro, Thrift, YARN and Mesos in a numerous ways. We look > >> > forward to collaborating with those communities, as well as other > Apache > >> > communities (including Apache S4 which focuses on stateful low-latency > >> > processing). > >> > > >> > === An Excessive Fascination with the Apache Brand === > >> > > >> > Parquet is an already healthy and well known open source project. This > >> > proposal is not for the purpose of generating publicity. Rather, the > >> > primary benefits to joining Apache are those outlined in the Rationale > >> > section. > >> > > >> > == Documentation == > >> > > >> > Documentation is currently located as README markdown files: > >> > > >> > * https://github.com/Parquet/parquet-format > >> > * https://github.com/Parquet/parquet-mr > >> > > >> > == Source and Intellectual Property Submission Plan == > >> > > >> > The Parquet codebase is currently hosted on Github: > >> > https://github.com/Parquet. > >> > > >> > This is the exact codebase that we would migrate to the Apache > >> foundation. > >> > > >> > == External Dependencies == > >> > > >> > * Junit: EPL > >> > * Apache Commons: ALv2 > >> > * Apache Thrift: ALv2 > >> > * Apache Maven: ALv2 > >> > * Apache Avro: ALv2 > >> > * Apache Hadoop: ALv2 > >> > * Google Guava: ALv2 > >> > > >> > == Cryptography == > >> > > >> > We do not expect Parquet to be a controlled export item due to the > use of > >> > encryption. > >> > > >> > == Required Resources == > >> > > >> > === Mailing lists === > >> > > >> > * parquet-dev > >> > * parquet-user > >> > > >> > == Subversion Directory == > >> > > >> > Git is the preferred source control system: git:// > git.apache.org/parquet > >> > > >> > == Issue Tracking == > >> > > >> > JIRA: Parquet (PARQUET) > >> > > >> > == Initial Committers == > >> > > >> > * Aniket Mokashi > >> > * Brock Noland > >> > * Chris Aniszczyk > >> > * Dmitriy Ryaboy > >> > * Jake Farrell > >> > * Julien Le Dem > >> > * Lukas Nalezenec > >> > * Marcel Kornacker > >> > * Mickael Lacour > >> > * Nong Li > >> > * Remy Pecqueur > >> > * Tianshuo Deng > >> > * Tom White > >> > > >> > == Affiliations == > >> > > >> > * Aniket Mokashi - Twitter > >> > * Brock Noland - Cloudera > >> > * Chris Aniszczyk - Twitter > >> > * Dmitriy Ryaboy - Twitter > >> > * Jake Farrell > >> > * Julien Le Dem - Twitter > >> > * Lukas Nalezenec > >> > * Marcel Kornacker - Cloudera > >> > * Mickael Lacour - Criteo > >> > * Nong Li - Cloudera > >> > * Remy Pecqueur - Criteo > >> > * Tianshuo Deng - Twitter > >> > * Tom White - Cloudera > >> > > >> > == Sponsors == > >> > > >> > === Champion === > >> > > >> > * Todd Lipcon > >> > > >> > === Nominated Mentors === > >> > > >> > * Tom White > >> > * Chris Mattmann > >> > * Jake Farrell > >> > > >> > === Sponsoring Entity === > >> > > >> > The Apache Incubator > >> > > >> > -- > >> > Cheers, > >> > > >> > Chris Aniszczyk > >> > http://aniszczyk.org > >> > +1 512 961 6719 > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > >> For additional commands, e-mail: general-help@incubator.apache.org > >> > >> > > > > > > -- > > Cheers, > > > > Chris Aniszczyk > > http://aniszczyk.org > > +1 512 961 6719 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > For additional commands, e-mail: general-help@incubator.apache.org > > -- Cheers, Chris Aniszczyk http://aniszczyk.org +1 512 961 6719 --485b397dd6299d07fe04f99bd3fa--