From general-return-66547-archive-asf-public=cust-asf.ponee.io@incubator.apache.org Wed Nov 14 19:05:33 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id A77C6180676 for ; Wed, 14 Nov 2018 19:05:32 +0100 (CET) Received: (qmail 99559 invoked by uid 500); 14 Nov 2018 18:05:31 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 99548 invoked by uid 99); 14 Nov 2018 18:05:31 -0000 Received: from mail-relay.apache.org (HELO mailrelay1-lw-us.apache.org) (207.244.88.152) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Nov 2018 18:05:31 +0000 Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 0532C1902 for ; Wed, 14 Nov 2018 18:05:29 +0000 (UTC) Received: by mail-lj1-f173.google.com with SMTP id t22-v6so14911195lji.7 for ; Wed, 14 Nov 2018 10:05:29 -0800 (PST) X-Gm-Message-State: AGRZ1gLxEuZ6VCBkD8DfEInMkzi3xkytCwRs0FZHctV91+QR5biTVhJX azAaB71EHvBGPRi2lH9cF12g33T1WGOAi5FTCia/Iw== X-Google-Smtp-Source: AJdET5d1OeTbfHUBf8bVffQ1kZLi6WMTBUp3P92zJ+ElrZ63JhT7zy+U6aLaF4p53ocHc4UGCQ4Ekga/hftq81RQrRo= X-Received: by 2002:a2e:9bc3:: with SMTP id w3-v6mr1720428ljj.70.1542218728256; Wed, 14 Nov 2018 10:05:28 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Ryan Blue Date: Wed, 14 Nov 2018 10:05:01 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [VOTE] Accept the Iceberg project for incubation To: Apache Incubator Content-Type: multipart/alternative; boundary="00000000000076db4f057aa3c75e" --00000000000076db4f057aa3c75e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Quick update: James Taylor has offered to mentor the project as well, so I've added him to the list. Thanks, James! On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue wrote: > The discuss thread seems to have reached consensus, so I propose acceptin= g > the Iceberg project for incubation. > > The proposal is copied below and in the wiki: > https://wiki.apache.org/incubator/IcebergProposal > > Please vote on whether to accept Iceberg in the next 72 hours: > > [ ] +1, accept Iceberg for incubation > [ ] -1, reject the Iceberg proposal because . . . > > Thank you for reviewing the proposal and voting, > > rb > ------------------------------ > Iceberg Proposal Abstract > > Iceberg is a table format for large, slow-moving tabular data. > > It is designed to improve on the de-facto standard table layout built int= o > Apache Hive, Presto, and Apache Spark. > Proposal > > The purpose of Iceberg is to provide SQL-like tables that are backed by > large sets of data files. Iceberg is similar to the Hive table layout, th= e > de-facto standard structure used to track files in a table, but provides > additional guarantees and performance optimizations: > > - Atomicity - Each change to the table is will be complete or will > fail. =E2=80=9CDo or do not. There is no try.=E2=80=9D > - Snapshot isolation - Reads use one and only one snapshot of a table > at some time without holding a lock. > - Safe schema evolution - A table=E2=80=99s schema can change in well-= defined > ways, without breaking older data files. > - Column projection - An engine may request a subset of the available > columns, including nested fields. > - Predicate pushdown - An engine can push filters into read planning > to improve performance using partition data and file-level statistics. > > Iceberg does NOT define a new file format. All data is stored in Apache > Avro, Apache ORC, or Apache Parquet files. > > Additionally, Iceberg is designed to work well when data files are stored > in cloud blob stores, even when those systems provide weaker guarantees > than a file system, including: > > - Eventual consistency in the namespace > - High latency for directory listings > - No renames of objects > - No folder hierarchy > > Rationale > > Initial benchmarks show dramatic improvements in query planning. For > example, in Netflix=E2=80=99s Atlas use case, which stores time-series me= trics from > Netflix runtime systems and 1 month is stored across 2.7 million files in > 2,688 partitions: > > - Hive table using Parquet: > - 400k+ splits, not combined > - Explain query: 9.6 minutes wall time (planning only) > - Iceberg table with partition filtering: > - 15,218 splits, combined > - Planning: 10 seconds > - Query wall time: 13 minutes > - Iceberg table with partition and min/max filtering: > - 412 splits > - Planning: 25 seconds > - Query wall time: 42 seconds > > These performance gains combined with the cross-engine compatibility are = a > very compelling story. > Initial Goals > > The initial goal will be to move the existing codebase to Apache and > integrate with the Apache development process and infrastructure. A prima= ry > goal of incubation will be to grow and diversify the Iceberg community. W= e > are well aware that the project community is largely comprised of > individuals from a single company. We aim to change that during incubatio= n. > Current Status > > As previously mentioned, Iceberg is under active development at Netflix, > and is being used in processing large volumes of data in Amazon EC2. > > Iceberg license documentation is already based on Apache guidelines for > LICENSE and NOTICE content. > Meritocracy > > We value meritocracy and we understand that it is the basis for an open > community that encourages multiple companies and individuals to contribut= e > and be invested in the project=E2=80=99s future. We will encourage and mo= nitor > participation and make sure to extend privileges and responsibilities to > all contributors. > Community > > Iceberg is currently being used by developers at Netflix and a growing > number of users are actively using it in production environments. Iceberg > has received contributions from developers working at Hortonworks, WeWork= , > and Palantir. By bringing Iceberg to Apache we aim to assure current and > future contributors that the Iceberg community is meritocratic and open, = in > order to broaden and diversity the user and developer community. > Core Developers > > Iceberg was initially developed at Netflix and is under active > development. We believe Netflix will be of interest to a broad range of > users and developers and that incubating the project at the ASF will help > us build a diverse, sustainable community. > Alignment > > Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC, > Parquet, Pig, and Spark. We anticipate integration with additional Apache > projects as the Iceberg community and interest in the project grows. > Known Risks Orphaned Products > > Netflix is committed to the future development of Iceberg and understands > that graduation to a TLP, while preferable, is not the only positive > outcome of incubation. > > Should the Iceberg project be accepted by the Incubator, the prospective > PPMC would be willing to agree to a target incubation period of 2 years o= r > less, knowing that every Incubator project incurs a certain cost in terms > of ASF infrastructure and volunteer time. > Inexperience with Open Source > > Three of the initial committers are Apache members and Incubator PMC > members. They will work with the other community members to teach them th= e > Apache Way. > Homogenous Developers > > The majority of the committers work at Netflix, though we are committed t= o > recruiting and developing additional committers from a wide spectrum of > industries and backgrounds. > Reliance on Salaried Developers > > It is expected that Iceberg development will occur on both salaried time > and on volunteer time, after hours. Most of the initial committers are pa= id > by Netflix to contribute to this project. However, they are all passionat= e > about the project, and we are both confident and hopeful that the project > will continue even if no salaried developers contribute to the project. > Relationships with Other Apache Products > > As mentioned in the Rationale section, Iceberg utilizes a number of > existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig, & Spark)= , > and we expect that list to expand as the community grows and diversifies. > Any Apache project in the big data space that needs to store or process > tabular data would be potentially relevant. > An Excessive Fascination with the Apache Brand > > We are applying to the Incubator process because we think it is the next > logical step for the Iceberg project after open-sourcing the code. This > proposal is not for the purpose of generating publicity. Rather, we want = to > make sure to create a very inclusive and meritocratic community, outside > the umbrella of a single company. Netflix has a long history of > contributing to Apache projects and the Iceberg developers and contributo= rs > understand the implication of making it an Apache project. > Required Resources Mailing lists > > - dev@iceberg.incubator.apache.org > - commits@iceberg.incubator.apache.org > - private@iceberg.incubator.apache.org > > The podling may also create a user mailing list, if needed. > Source Control and Issue Tracking > > The Iceberg podling would use Apache=E2=80=99s gitbox integration to sync= between > github and Apache infrastructure. The podling would use github issues and > pull requests for community engagement. > Current Resources > > - Initial source: https://github.com/Netflix/iceberg > - Java documentation: > https://netflix.github.io/iceberg/current/javadoc/index.html?com/netfl= ix/iceberg/package-summary.html > - Table specification: > https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMy= GBKslOswA/edit > > Source and Intellectual Property Submission Plan > > The Iceberg source code in Github is currently licensed under Apache > License v2.0 and the copyright is assigned to Netflix. If Iceberg becomes > an Incubator project at the ASF, Netflix will transfer the source code an= d > trademark ownership to the Apache Software Foundation via a Software Gran= t > Agreement. > External Dependencies > > External dependencies licensed under Apache License 2.0 > > - Guava https://github.com/google/guava > - Jackson https://github.com/FasterXML/jackson-core > - Joda-Time http://www.joda.org/joda-time/ > > External dependencies licensed under the MIT License > > - SLF4J https://www.slf4j.org/ > - Mockito https://github.com/mockito/mockito > > ASF Projects > > - Apache Avro > - Apache Hadoop > - Apache Hive > - Apache ORC > - Apache Parquet > - Apache Pig > - Apache Spark > > Cryptography > > We do not expect Iceberg to be a controlled export item due to the use of > encryption. > Initial Committers and Affiliations > > - Ryan Blue blue@apache.org (Netflix) > - Parth Brahmbhatt parth@apache.org (Netflix) > - Julien Le Dem julien@apache.org (WeWork) > - Owen O=E2=80=99Malley omalley@apache.org (Hortonworks) > - Daniel Weeks dweeks@apache.org (Netflix) > > Sponsors and Nominated Mentors > > - Champion and mentor: Owen O=E2=80=99Malley omalley@apache.org > - Mentor: Ryan Blue blue@apache.org > - Mentor: Julien Le Dem julien@apache.org > > Sponsoring Entity > > The Apache Incubator > -- > Ryan Blue > --=20 Ryan Blue --00000000000076db4f057aa3c75e--