incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: [VOTE] Accept the Iceberg project for incubation
Date Tue, 13 Nov 2018 17:19:22 GMT
+1 (binding)

On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher <dave2wave@comcast.net> wrote:

> +1 (binding)
>
> > On Nov 13, 2018, at 9:10 AM, Matt Sicker <boards@gmail.com> wrote:
> >
> > +1 binding
> >
> > On Tue, 13 Nov 2018 at 11:09, Ryan Blue <blue@apache.org> wrote:
> >
> >> +1 (binding)
> >>
> >> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue <blue@apache.org> wrote:
> >>
> >>> The discuss thread seems to have reached consensus, so I propose
> >> accepting
> >>> the Iceberg project for incubation.
> >>>
> >>> The proposal is copied below and in the wiki:
> >>> https://wiki.apache.org/incubator/IcebergProposal
> >>>
> >>> Please vote on whether to accept Iceberg in the next 72 hours:
> >>>
> >>> [ ] +1, accept Iceberg for incubation
> >>> [ ] -1, reject the Iceberg proposal because . . .
> >>>
> >>> Thank you for reviewing the proposal and voting,
> >>>
> >>> rb
> >>> ------------------------------
> >>> Iceberg Proposal Abstract
> >>>
> >>> Iceberg is a table format for large, slow-moving tabular data.
> >>>
> >>> It is designed to improve on the de-facto standard table layout built
> >> into
> >>> Apache Hive, Presto, and Apache Spark.
> >>> Proposal
> >>>
> >>> The purpose of Iceberg is to provide SQL-like tables that are backed by
> >>> large sets of data files. Iceberg is similar to the Hive table layout,
> >> the
> >>> de-facto standard structure used to track files in a table, but
> provides
> >>> additional guarantees and performance optimizations:
> >>>
> >>>   - Atomicity - Each change to the table is will be complete or will
> >>>   fail. “Do or do not. There is no try.”
> >>>   - Snapshot isolation - Reads use one and only one snapshot of a table
> >>>   at some time without holding a lock.
> >>>   - Safe schema evolution - A table’s schema can change in well-defined
> >>>   ways, without breaking older data files.
> >>>   - Column projection - An engine may request a subset of the available
> >>>   columns, including nested fields.
> >>>   - Predicate pushdown - An engine can push filters into read planning
> >>>   to improve performance using partition data and file-level
> statistics.
> >>>
> >>> Iceberg does NOT define a new file format. All data is stored in Apache
> >>> Avro, Apache ORC, or Apache Parquet files.
> >>>
> >>> Additionally, Iceberg is designed to work well when data files are
> stored
> >>> in cloud blob stores, even when those systems provide weaker guarantees
> >>> than a file system, including:
> >>>
> >>>   - Eventual consistency in the namespace
> >>>   - High latency for directory listings
> >>>   - No renames of objects
> >>>   - No folder hierarchy
> >>>
> >>> Rationale
> >>>
> >>> Initial benchmarks show dramatic improvements in query planning. For
> >>> example, in Netflix’s Atlas use case, which stores time-series metrics
> >> from
> >>> Netflix runtime systems and 1 month is stored across 2.7 million files
> in
> >>> 2,688 partitions:
> >>>
> >>>   - Hive table using Parquet:
> >>>      - 400k+ splits, not combined
> >>>      - Explain query: 9.6 minutes wall time (planning only)
> >>>   - Iceberg table with partition filtering:
> >>>      - 15,218 splits, combined
> >>>      - Planning: 10 seconds
> >>>      - Query wall time: 13 minutes
> >>>   - Iceberg table with partition and min/max filtering:
> >>>      - 412 splits
> >>>      - Planning: 25 seconds
> >>>      - Query wall time: 42 seconds
> >>>
> >>> These performance gains combined with the cross-engine compatibility
> are
> >> a
> >>> very compelling story.
> >>> Initial Goals
> >>>
> >>> The initial goal will be to move the existing codebase to Apache and
> >>> integrate with the Apache development process and infrastructure. A
> >> primary
> >>> goal of incubation will be to grow and diversify the Iceberg community.
> >> We
> >>> are well aware that the project community is largely comprised of
> >>> individuals from a single company. We aim to change that during
> >> incubation.
> >>> Current Status
> >>>
> >>> As previously mentioned, Iceberg is under active development at
> Netflix,
> >>> and is being used in processing large volumes of data in Amazon EC2.
> >>>
> >>> Iceberg license documentation is already based on Apache guidelines for
> >>> LICENSE and NOTICE content.
> >>> Meritocracy
> >>>
> >>> We value meritocracy and we understand that it is the basis for an open
> >>> community that encourages multiple companies and individuals to
> >> contribute
> >>> and be invested in the project’s future. We will encourage and monitor
> >>> participation and make sure to extend privileges and responsibilities
> to
> >>> all contributors.
> >>> Community
> >>>
> >>> Iceberg is currently being used by developers at Netflix and a growing
> >>> number of users are actively using it in production environments.
> Iceberg
> >>> has received contributions from developers working at Hortonworks,
> >> WeWork,
> >>> and Palantir. By bringing Iceberg to Apache we aim to assure current
> and
> >>> future contributors that the Iceberg community is meritocratic and
> open,
> >> in
> >>> order to broaden and diversity the user and developer community.
> >>> Core Developers
> >>>
> >>> Iceberg was initially developed at Netflix and is under active
> >>> development. We believe Netflix will be of interest to a broad range of
> >>> users and developers and that incubating the project at the ASF will
> help
> >>> us build a diverse, sustainable community.
> >>> Alignment
> >>>
> >>> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive, ORC,
> >>> Parquet, Pig, and Spark. We anticipate integration with additional
> Apache
> >>> projects as the Iceberg community and interest in the project grows.
> >>> Known Risks Orphaned Products
> >>>
> >>> Netflix is committed to the future development of Iceberg and
> understands
> >>> that graduation to a TLP, while preferable, is not the only positive
> >>> outcome of incubation.
> >>>
> >>> Should the Iceberg project be accepted by the Incubator, the
> prospective
> >>> PPMC would be willing to agree to a target incubation period of 2 years
> >> or
> >>> less, knowing that every Incubator project incurs a certain cost in
> terms
> >>> of ASF infrastructure and volunteer time.
> >>> Inexperience with Open Source
> >>>
> >>> Three of the initial committers are Apache members and Incubator PMC
> >>> members. They will work with the other community members to teach them
> >> the
> >>> Apache Way.
> >>> Homogenous Developers
> >>>
> >>> The majority of the committers work at Netflix, though we are committed
> >> to
> >>> recruiting and developing additional committers from a wide spectrum of
> >>> industries and backgrounds.
> >>> Reliance on Salaried Developers
> >>>
> >>> It is expected that Iceberg development will occur on both salaried
> time
> >>> and on volunteer time, after hours. Most of the initial committers are
> >> paid
> >>> by Netflix to contribute to this project. However, they are all
> >> passionate
> >>> about the project, and we are both confident and hopeful that the
> project
> >>> will continue even if no salaried developers contribute to the project.
> >>> Relationships with Other Apache Products
> >>>
> >>> As mentioned in the Rationale section, Iceberg utilizes a number of
> >>> existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig, &
> >> Spark),
> >>> and we expect that list to expand as the community grows and
> diversifies.
> >>> Any Apache project in the big data space that needs to store or process
> >>> tabular data would be potentially relevant.
> >>> An Excessive Fascination with the Apache Brand
> >>>
> >>> We are applying to the Incubator process because we think it is the
> next
> >>> logical step for the Iceberg project after open-sourcing the code. This
> >>> proposal is not for the purpose of generating publicity. Rather, we
> want
> >> to
> >>> make sure to create a very inclusive and meritocratic community,
> outside
> >>> the umbrella of a single company. Netflix has a long history of
> >>> contributing to Apache projects and the Iceberg developers and
> >> contributors
> >>> understand the implication of making it an Apache project.
> >>> Required Resources Mailing lists
> >>>
> >>>   - dev@iceberg.incubator.apache.org
> >>>   - commits@iceberg.incubator.apache.org
> >>>   - private@iceberg.incubator.apache.org
> >>>
> >>> The podling may also create a user mailing list, if needed.
> >>> Source Control and Issue Tracking
> >>>
> >>> The Iceberg podling would use Apache’s gitbox integration to sync
> between
> >>> github and Apache infrastructure. The podling would use github issues
> and
> >>> pull requests for community engagement.
> >>> Current Resources
> >>>
> >>>   - Initial source: https://github.com/Netflix/iceberg
> >>>   - Java documentation:
> >>>
> >>
> https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html
> >>>   - Table specification:
> >>>
> >>
> https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit
> >>>
> >>> Source and Intellectual Property Submission Plan
> >>>
> >>> The Iceberg source code in Github is currently licensed under Apache
> >>> License v2.0 and the copyright is assigned to Netflix. If Iceberg
> becomes
> >>> an Incubator project at the ASF, Netflix will transfer the source code
> >> and
> >>> trademark ownership to the Apache Software Foundation via a Software
> >> Grant
> >>> Agreement.
> >>> External Dependencies
> >>>
> >>> External dependencies licensed under Apache License 2.0
> >>>
> >>>   - Guava https://github.com/google/guava
> >>>   - Jackson https://github.com/FasterXML/jackson-core
> >>>   - Joda-Time http://www.joda.org/joda-time/
> >>>
> >>> External dependencies licensed under the MIT License
> >>>
> >>>   - SLF4J https://www.slf4j.org/
> >>>   - Mockito https://github.com/mockito/mockito
> >>>
> >>> ASF Projects
> >>>
> >>>   - Apache Avro
> >>>   - Apache Hadoop
> >>>   - Apache Hive
> >>>   - Apache ORC
> >>>   - Apache Parquet
> >>>   - Apache Pig
> >>>   - Apache Spark
> >>>
> >>> Cryptography
> >>>
> >>> We do not expect Iceberg to be a controlled export item due to the use
> of
> >>> encryption.
> >>> Initial Committers and Affiliations
> >>>
> >>>   - Ryan Blue blue@apache.org (Netflix)
> >>>   - Parth Brahmbhatt parth@apache.org (Netflix)
> >>>   - Julien Le Dem julien@apache.org (WeWork)
> >>>   - Owen O’Malley omalley@apache.org (Hortonworks)
> >>>   - Daniel Weeks dweeks@apache.org (Netflix)
> >>>
> >>> Sponsors and Nominated Mentors
> >>>
> >>>   - Champion and mentor: Owen O’Malley omalley@apache.org
> >>>   - Mentor: Ryan Blue blue@apache.org
> >>>   - Mentor: Julien Le Dem julien@apache.org
> >>>
> >>> Sponsoring Entity
> >>>
> >>> The Apache Incubator
> >>> --
> >>> Ryan Blue
> >>>
> >>
> >>
> >> --
> >> Ryan Blue
> >>
> >
> >
> > --
> > Matt Sicker <boards@gmail.com>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message