incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Fisher <dave2w...@comcast.net>
Subject Re: [VOTE] Accept the Iceberg project for incubation
Date Tue, 13 Nov 2018 17:12:25 GMT
+1 (binding)

> On Nov 13, 2018, at 9:10 AM, Matt Sicker <boards@gmail.com> wrote:
> 
> +1 binding
> 
> On Tue, 13 Nov 2018 at 11:09, Ryan Blue <blue@apache.org> wrote:
> 
>> +1 (binding)
>> 
>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue <blue@apache.org> wrote:
>> 
>>> The discuss thread seems to have reached consensus, so I propose
>> accepting
>>> the Iceberg project for incubation.
>>> 
>>> The proposal is copied below and in the wiki:
>>> https://wiki.apache.org/incubator/IcebergProposal
>>> 
>>> Please vote on whether to accept Iceberg in the next 72 hours:
>>> 
>>> [ ] +1, accept Iceberg for incubation
>>> [ ] -1, reject the Iceberg proposal because . . .
>>> 
>>> Thank you for reviewing the proposal and voting,
>>> 
>>> rb
>>> ------------------------------
>>> Iceberg Proposal Abstract
>>> 
>>> Iceberg is a table format for large, slow-moving tabular data.
>>> 
>>> It is designed to improve on the de-facto standard table layout built
>> into
>>> Apache Hive, Presto, and Apache Spark.
>>> Proposal
>>> 
>>> The purpose of Iceberg is to provide SQL-like tables that are backed by
>>> large sets of data files. Iceberg is similar to the Hive table layout,
>> the
>>> de-facto standard structure used to track files in a table, but provides
>>> additional guarantees and performance optimizations:
>>> 
>>>   - Atomicity - Each change to the table is will be complete or will
>>>   fail. “Do or do not. There is no try.”
>>>   - Snapshot isolation - Reads use one and only one snapshot of a table
>>>   at some time without holding a lock.
>>>   - Safe schema evolution - A table’s schema can change in well-defined
>>>   ways, without breaking older data files.
>>>   - Column projection - An engine may request a subset of the available
>>>   columns, including nested fields.
>>>   - Predicate pushdown - An engine can push filters into read planning
>>>   to improve performance using partition data and file-level statistics.
>>> 
>>> Iceberg does NOT define a new file format. All data is stored in Apache
>>> Avro, Apache ORC, or Apache Parquet files.
>>> 
>>> Additionally, Iceberg is designed to work well when data files are stored
>>> in cloud blob stores, even when those systems provide weaker guarantees
>>> than a file system, including:
>>> 
>>>   - Eventual consistency in the namespace
>>>   - High latency for directory listings
>>>   - No renames of objects
>>>   - No folder hierarchy
>>> 
>>> Rationale
>>> 
>>> Initial benchmarks show dramatic improvements in query planning. For
>>> example, in Netflix’s Atlas use case, which stores time-series metrics
>> from
>>> Netflix runtime systems and 1 month is stored across 2.7 million files in
>>> 2,688 partitions:
>>> 
>>>   - Hive table using Parquet:
>>>      - 400k+ splits, not combined
>>>      - Explain query: 9.6 minutes wall time (planning only)
>>>   - Iceberg table with partition filtering:
>>>      - 15,218 splits, combined
>>>      - Planning: 10 seconds
>>>      - Query wall time: 13 minutes
>>>   - Iceberg table with partition and min/max filtering:
>>>      - 412 splits
>>>      - Planning: 25 seconds
>>>      - Query wall time: 42 seconds
>>> 
>>> These performance gains combined with the cross-engine compatibility are
>> a
>>> very compelling story.
>>> Initial Goals
>>> 
>>> The initial goal will be to move the existing codebase to Apache and
>>> integrate with the Apache development process and infrastructure. A
>> primary
>>> goal of incubation will be to grow and diversify the Iceberg community.
>> We
>>> are well aware that the project community is largely comprised of
>>> individuals from a single company. We aim to change that during
>> incubation.
>>> Current Status
>>> 
>>> As previously mentioned, Iceberg is under active development at Netflix,
>>> and is being used in processing large volumes of data in Amazon EC2.
>>> 
>>> Iceberg license documentation is already based on Apache guidelines for
>>> LICENSE and NOTICE content.
>>> Meritocracy
>>> 
>>> We value meritocracy and we understand that it is the basis for an open
>>> community that encourages multiple companies and individuals to
>> contribute
>>> and be invested in the project’s future. We will encourage and monitor
>>> participation and make sure to extend privileges and responsibilities to
>>> all contributors.
>>> Community
>>> 
>>> Iceberg is currently being used by developers at Netflix and a growing
>>> number of users are actively using it in production environments. Iceberg
>>> has received contributions from developers working at Hortonworks,
>> WeWork,
>>> and Palantir. By bringing Iceberg to Apache we aim to assure current and
>>> future contributors that the Iceberg community is meritocratic and open,
>> in
>>> order to broaden and diversity the user and developer community.
>>> Core Developers
>>> 
>>> Iceberg was initially developed at Netflix and is under active
>>> development. We believe Netflix will be of interest to a broad range of
>>> users and developers and that incubating the project at the ASF will help
>>> us build a diverse, sustainable community.
>>> Alignment
>>> 
>>> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC,
>>> Parquet, Pig, and Spark. We anticipate integration with additional Apache
>>> projects as the Iceberg community and interest in the project grows.
>>> Known Risks Orphaned Products
>>> 
>>> Netflix is committed to the future development of Iceberg and understands
>>> that graduation to a TLP, while preferable, is not the only positive
>>> outcome of incubation.
>>> 
>>> Should the Iceberg project be accepted by the Incubator, the prospective
>>> PPMC would be willing to agree to a target incubation period of 2 years
>> or
>>> less, knowing that every Incubator project incurs a certain cost in terms
>>> of ASF infrastructure and volunteer time.
>>> Inexperience with Open Source
>>> 
>>> Three of the initial committers are Apache members and Incubator PMC
>>> members. They will work with the other community members to teach them
>> the
>>> Apache Way.
>>> Homogenous Developers
>>> 
>>> The majority of the committers work at Netflix, though we are committed
>> to
>>> recruiting and developing additional committers from a wide spectrum of
>>> industries and backgrounds.
>>> Reliance on Salaried Developers
>>> 
>>> It is expected that Iceberg development will occur on both salaried time
>>> and on volunteer time, after hours. Most of the initial committers are
>> paid
>>> by Netflix to contribute to this project. However, they are all
>> passionate
>>> about the project, and we are both confident and hopeful that the project
>>> will continue even if no salaried developers contribute to the project.
>>> Relationships with Other Apache Products
>>> 
>>> As mentioned in the Rationale section, Iceberg utilizes a number of
>>> existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig, &
>> Spark),
>>> and we expect that list to expand as the community grows and diversifies.
>>> Any Apache project in the big data space that needs to store or process
>>> tabular data would be potentially relevant.
>>> An Excessive Fascination with the Apache Brand
>>> 
>>> We are applying to the Incubator process because we think it is the next
>>> logical step for the Iceberg project after open-sourcing the code. This
>>> proposal is not for the purpose of generating publicity. Rather, we want
>> to
>>> make sure to create a very inclusive and meritocratic community, outside
>>> the umbrella of a single company. Netflix has a long history of
>>> contributing to Apache projects and the Iceberg developers and
>> contributors
>>> understand the implication of making it an Apache project.
>>> Required Resources Mailing lists
>>> 
>>>   - dev@iceberg.incubator.apache.org
>>>   - commits@iceberg.incubator.apache.org
>>>   - private@iceberg.incubator.apache.org
>>> 
>>> The podling may also create a user mailing list, if needed.
>>> Source Control and Issue Tracking
>>> 
>>> The Iceberg podling would use Apache’s gitbox integration to sync between
>>> github and Apache infrastructure. The podling would use github issues and
>>> pull requests for community engagement.
>>> Current Resources
>>> 
>>>   - Initial source: https://github.com/Netflix/iceberg
>>>   - Java documentation:
>>> 
>> https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html
>>>   - Table specification:
>>> 
>> https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit
>>> 
>>> Source and Intellectual Property Submission Plan
>>> 
>>> The Iceberg source code in Github is currently licensed under Apache
>>> License v2.0 and the copyright is assigned to Netflix. If Iceberg becomes
>>> an Incubator project at the ASF, Netflix will transfer the source code
>> and
>>> trademark ownership to the Apache Software Foundation via a Software
>> Grant
>>> Agreement.
>>> External Dependencies
>>> 
>>> External dependencies licensed under Apache License 2.0
>>> 
>>>   - Guava https://github.com/google/guava
>>>   - Jackson https://github.com/FasterXML/jackson-core
>>>   - Joda-Time http://www.joda.org/joda-time/
>>> 
>>> External dependencies licensed under the MIT License
>>> 
>>>   - SLF4J https://www.slf4j.org/
>>>   - Mockito https://github.com/mockito/mockito
>>> 
>>> ASF Projects
>>> 
>>>   - Apache Avro
>>>   - Apache Hadoop
>>>   - Apache Hive
>>>   - Apache ORC
>>>   - Apache Parquet
>>>   - Apache Pig
>>>   - Apache Spark
>>> 
>>> Cryptography
>>> 
>>> We do not expect Iceberg to be a controlled export item due to the use of
>>> encryption.
>>> Initial Committers and Affiliations
>>> 
>>>   - Ryan Blue blue@apache.org (Netflix)
>>>   - Parth Brahmbhatt parth@apache.org (Netflix)
>>>   - Julien Le Dem julien@apache.org (WeWork)
>>>   - Owen O’Malley omalley@apache.org (Hortonworks)
>>>   - Daniel Weeks dweeks@apache.org (Netflix)
>>> 
>>> Sponsors and Nominated Mentors
>>> 
>>>   - Champion and mentor: Owen O’Malley omalley@apache.org
>>>   - Mentor: Ryan Blue blue@apache.org
>>>   - Mentor: Julien Le Dem julien@apache.org
>>> 
>>> Sponsoring Entity
>>> 
>>> The Apache Incubator
>>> --
>>> Ryan Blue
>>> 
>> 
>> 
>> --
>> Ryan Blue
>> 
> 
> 
> -- 
> Matt Sicker <boards@gmail.com>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message