incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: [VOTE] Accept Kudu into the Apache Incubator
Date Thu, 26 Nov 2015 20:03:54 GMT
+1 (non-binding)

On Wed, Nov 25, 2015 at 5:26 PM, Hitesh Shah <hitesh@apache.org> wrote:
> +1 (binding)
>
> — Hitesh
>
> On Nov 24, 2015, at 11:32 AM, Todd Lipcon <todd@apache.org> wrote:
>
>> Hi all,
>>
>> Discussion on the [DISCUSS] thread seems to have wound down, so I'd like to
>> call a VOTE on acceptance of Kudu into the ASF Incubator. The proposal is
>> pasted below and also available on the wiki at:
>> https://wiki.apache.org/incubator/KuduProposal
>>
>> The proposal is unchanged since the original version, except for the
>> addition of Carl Steinbach as a Mentor.
>>
>> Please cast your votes:
>>
>> [] +1, accept Kudu into the Incubator
>> [] +/-0, positive/negative non-counted expression of feelings
>> [] -1, do not accept Kudu into the incubator (please state reasoning)
>>
>> Given the US holiday this week, I imagine many folks are traveling or
>> otherwise offline. So, let's run the vote for a full week rather than the
>> traditional 72 hours. Unless the IPMC objects to the extended voting
>> period, the vote will close on Tues, Dec 1st at noon PST.
>>
>> Thanks
>> -Todd
>> -----
>>
>> = Kudu Proposal =
>>
>> == Abstract ==
>>
>> Kudu is a distributed columnar storage engine built for the Apache Hadoop
>> ecosystem.
>>
>> == Proposal ==
>>
>> Kudu is an open source storage engine for structured data which supports
>> low-latency random access together with efficient analytical access
>> patterns. Kudu distributes data using horizontal partitioning and
>> replicates each partition using Raft consensus, providing low
>> mean-time-to-recovery and low tail latencies. Kudu is designed within the
>> context of the Apache Hadoop ecosystem and supports many integrations with
>> other data analytics projects both inside and outside of the Apache
>> Software Foundation.
>>
>>
>>
>> We propose to incubate Kudu as a project of the Apache Software Foundation.
>>
>> == Background ==
>>
>> In recent years, explosive growth in the amount of data being generated and
>> captured by enterprises has resulted in the rapid adoption of open source
>> technology which is able to store massive data sets at scale and at low
>> cost. In particular, the Apache Hadoop ecosystem has become a focal point
>> for such “big data” workloads, because many traditional open source
>> database systems have lagged in offering a scalable alternative.
>>
>>
>>
>> Structured storage in the Hadoop ecosystem has typically been achieved in
>> two ways: for static data sets, data is typically stored on Apache HDFS
>> using binary data formats such as Apache Avro or Apache Parquet. However,
>> neither HDFS nor these formats has any provision for updating individual
>> records, or for efficient random access. Mutable data sets are typically
>> stored in semi-structured stores such as Apache HBase or Apache Cassandra.
>> These systems allow for low-latency record-level reads and writes, but lag
>> far behind the static file formats in terms of sequential read throughput
>> for applications such as SQL-based analytics or machine learning.
>>
>>
>>
>> Kudu is a new storage system designed and implemented from the ground up to
>> fill this gap between high-throughput sequential-access storage systems
>> such as HDFS and low-latency random-access systems such as HBase or
>> Cassandra. While these existing systems continue to hold advantages in some
>> situations, Kudu offers a “happy medium” alternative that can dramatically
>> simplify the architecture of many common workloads. In particular, Kudu
>> offers a simple API for row-level inserts, updates, and deletes, while
>> providing table scans at throughputs similar to Parquet, a commonly-used
>> columnar format for static data.
>>
>>
>>
>> More information on Kudu can be found at the existing open source project
>> website: http://getkudu.io and in particular in the Kudu white-paper PDF:
>> http://getkudu.io/kudu.pdf from which the above was excerpted.
>>
>> == Rationale ==
>>
>> As described above, Kudu fills an important gap in the open source storage
>> ecosystem. After our initial open source project release in September 2015,
>> we have seen a great amount of interest across a diverse set of users and
>> companies. We believe that, as a storage system, it is critical to build an
>> equally diverse set of contributors in the development community. Our
>> experiences as committers and PMC members on other Apache projects have
>> taught us the value of diverse communities in ensuring both longevity and
>> high quality for such foundational systems.
>>
>> == Initial Goals ==
>>
>> * Move the existing codebase, website, documentation, and mailing lists to
>> Apache-hosted infrastructure
>> * Work with the infrastructure team to implement and approve our code
>> review, build, and testing workflows in the context of the ASF
>> * Incremental development and releases per Apache guidelines
>>
>> == Current Status ==
>>
>> ==== Releases ====
>>
>> Kudu has undergone one public release, tagged here
>> https://github.com/cloudera/kudu/tree/kudu0.5.0-release
>>
>> This initial release was not performed in the typical ASF fashion -- no
>> source tarball was released, but rather only convenience binaries made
>> available in Cloudera’s repositories. We will adopt the ASF source release
>> process upon joining the incubator.
>>
>>
>> ==== Source ====
>>
>> Kudu’s source is currently hosted on GitHub at
>> https://github.com/cloudera/kudu
>>
>> This repository will be transitioned to Apache’s git hosting during
>> incubation.
>>
>>
>>
>> ==== Code review ====
>>
>> Kudu’s code reviews are currently public and hosted on Gerrit at
>> http://gerrit.cloudera.org:8080/#/q/status:open+project:kudu
>>
>> The Kudu developer community is very happy with gerrit and hopes to work
>> with the Apache Infrastructure team to figure out how we can continue to
>> use Gerrit within ASF policies.
>>
>>
>>
>> ==== Issue tracking ====
>>
>> Kudu’s bug and feature tracking is hosted on JIRA at:
>> https://issues.cloudera.org/projects/KUDU/summary
>>
>> This JIRA instance contains bugs and development discussion dating back 2
>> years prior to Kudu’s open source release and will provide an initial seed
>> for the ASF JIRA.
>>
>>
>>
>> ==== Community discussion ====
>>
>> Kudu has several public discussion forums, linked here:
>> http://getkudu.io/community.html
>>
>>
>>
>> ==== Build Infrastructure ====
>>
>> The Kudu Gerrit instance is configured to only allow patches to be
>> committed after running them through an extensive set of pre-commit tests
>> and code lints. The project currently makes use of elastic public cloud
>> resources to perform these tests. Until this point, these resources have
>> been internal to Cloudera, though we are currently investing in moving to a
>> publicly accessible infrastructure.
>>
>>
>>
>> ==== Development practices ====
>>
>> Given that Kudu is a persistent storage engine, the community has a high
>> quality bar for contributions to its core. We have a firm belief that high
>> quality is achieved through automation, not manual inspection, and hence
>> put a focus on thorough testing and build infrastructure to ensure that
>> bar. The development community also practices review-then-commit for all
>> changes to ensure that changes are accompanied by appropriate tests, are
>> well commented, etc.
>>
>> Rather than seeing these practices as barriers to contribution, we believe
>> that a fully automated and standardized review and testing practice makes
>> it easier for new contributors to have patches accepted. Any new developer
>> may post a patch to Gerrit using the same workflow as a seasoned
>> contributor, and the same suite of tests will be automatically run. If the
>> tests pass, a committer can quickly review and commit the contribution from
>> their web browser.
>>
>> === Meritocracy ===
>>
>> We believe strongly in meritocracy in electing committers and PMC members.
>> We believe that contributions can come in forms other than just code: for
>> example, one of our initial proposed committers has contributed solely in
>> the area of project documentation. We will encourage contributions and
>> participation of all types, and ensure that contributors are appropriately
>> recognized.
>>
>> === Community ===
>>
>> Though Kudu is relatively new as an open source project, it has already
>> seen promising growth in its community across several organizations:
>>
>> * '''Cloudera''' is the original development sponsor for Kudu.
>> * '''Xiaomi''' has been helping to develop and optimize Kudu for a new
>> production use case, contributing code, benchmarks, feedback, and
>> conference talks.
>> * '''Intel''' has contributed optimizations related to their hardware
>> technologies.
>> * '''Dropbox''' has been experimenting with Kudu for a machine monitoring
>> use case, and has been contributing bug reports and product feedback.
>> * '''Dremio''' is working on integration with Apache Drill and exploring
>> using Kudu in a production use case.
>> * Several community-built Docker images, tutorials, and blog posts have
>> sprouted up since Kudu’s release.
>>
>>
>>
>> By bringing Kudu to Apache, we hope to encourage further contribution from
>> the above organizations as well as to engage new users and contributors in
>> the community.
>>
>> === Core Developers ===
>>
>> Kudu was initially developed as a project at Cloudera. Most of the
>> contributions to date have been by developers employed by Cloudera.
>>
>>
>>
>> Many of the developers are committers or PMC members on other Apache
>> projects.
>>
>> === Alignment ===
>>
>> As a project in the big data ecosystem, Kudu is aligned with several other
>> ASF projects. Kudu includes input/output format integration with Apache
>> Hadoop, and this integration can also provide a bridge to Apache Spark. We
>> are planning to integrate with Apache Hive in the near future. We also
>> integrate closely with Cloudera Impala, which is also currently being
>> proposed for incubation. We have also scheduled a hackathon with the Apache
>> Drill team to work on integration with that query engine.
>>
>> == Known Risks ==
>>
>> === Orphaned Products ===
>>
>> The risk of Kudu being abandoned is low. Cloudera has invested a great deal
>> in the initial development of the project, and intends to grow its
>> investment over time as Kudu becomes a product adopted by its customer
>> base. Several other organizations are also experimenting with Kudu for
>> production use cases which would live for many years.
>>
>> === Inexperience with Open Source ===
>>
>> Kudu has been released in the open for less than two months. However, from
>> our very first public announcement we have been committed to open-source
>> style development:
>>
>> * our code reviews are fully public and documented on a mailing list
>> * our daily development chatter is in a public chat room
>> * we send out weekly “community status” reports highlighting news and
>> contributions
>> * we published our entire JIRA history and discuss bugs in the open
>> * we published our entire Git commit history, going back three years (no
>> squashing)
>>
>>
>>
>> Several of the initial committers are experienced open source developers,
>> several being committers and/or PMC members on other ASF projects (Hadoop,
>> HBase, Thrift, Flume, et al). Those who are not ASF committers have
>> experience on non-ASF open source projects (Kiji, open-vm-tools, et al).
>>
>> === Homogenous Developers ===
>>
>> The initial committers are employees or former employees of Cloudera.
>> However, the committers are spread across multiple offices (Palo Alto, San
>> Francisco, Melbourne), so the team is familiar with working in a
>> distributed environment across varied time zones.
>>
>>
>>
>> The project has received some contributions from developers outside of
>> Cloudera, and is starting to attract a ''user'' community as well. We hope
>> to continue to encourage contributions from these developers and community
>> members and grow them into committers after they have had time to continue
>> their contributions.
>>
>> === Reliance on Salaried Developers ===
>>
>> As mentioned above, the majority of development up to this point has been
>> sponsored by Cloudera. We have seen several community users participate in
>> discussions who are hobbyists interested in distributed systems and
>> databases, and hope that they will continue their participation in the
>> project going forward.
>>
>> === Relationships with Other Apache Products ===
>>
>> Kudu is currently related to the following other Apache projects:
>>
>> * Hadoop: Kudu provides MapReduce input/output formats for integration
>> * Spark: Kudu integrates with Spark via the above-mentioned input formats,
>> and work is progressing on support for Spark Data Frames and Spark SQL.
>>
>>
>>
>> The Kudu team has reached out to several other Apache projects to start
>> discussing integrations, including Flume, Kafka, Hive, and Drill.
>>
>>
>>
>> Kudu integrates with Impala, which is also being proposed for incubation.
>>
>>
>>
>> Kudu is already collaborating on ValueVector, a proposed TLP spinning out
>> from the Apache Drill community.
>>
>>
>>
>> We look forward to continuing to integrate and collaborate with these
>> communities.
>>
>> === An Excessive Fascination with the Apache Brand ===
>>
>> Many of the initial committers are already experienced Apache committers,
>> and understand the true value provided by the Apache Way and the principles
>> of the ASF. We believe that this development and contribution model is
>> especially appropriate for storage products, where Apache’s
>> community-over-code philosophy ensures long term viability and
>> consensus-based participation.
>>
>> == Documentation ==
>>
>> * Documentation is written in AsciiDoc and committed in the Kudu source
>> repository:
>>
>> * https://github.com/cloudera/kudu/tree/master/docs
>>
>>
>>
>> * The Kudu web site is version-controlled on the ‘gh-pages’ branch of the
>> above repository.
>>
>> * A LaTeX whitepaper is also published, and the source is available within
>> the same repository.
>> * APIs are documented within the source code as JavaDoc or C++-style
>> documentation comments.
>> * Many design documents are stored within the source code repository as
>> text files next to the code being documented.
>>
>> == Source and Intellectual Property Submission Plan ==
>>
>> The Kudu codebase and web site is currently hosted on GitHub and will be
>> transitioned to the ASF repositories during incubation. Kudu is already
>> licensed under the Apache 2.0 license.
>>
>>
>>
>> Some portions of the code are imported from other open source projects
>> under the Apache 2.0, BSD, or MIT licenses, with copyrights held by authors
>> other than the initial committers. These copyright notices are maintained
>> in those files as well as a top-level NOTICE.txt file. We believe this to
>> be permissible under the license terms and ASF policies, and confirmed via
>> a recent thread on general@incubator.apache.org .
>>
>>
>>
>> The “Kudu” name is not a registered trademark, though before the initial
>> release of the project, we performed a trademark search and Cloudera’s
>> legal counsel deemed it acceptable in the context of a data storage engine.
>> There exists an unrelated open source project by the same name related to
>> deployments on Microsoft’s Azure cloud service. We have been in contact
>> with legal counsel from Microsoft and have obtained their approval for the
>> use of the Kudu name.
>>
>>
>>
>> Cloudera currently owns several domain names related to Kudu (getkudu.io,
>> kududb.io, et al) which will be transferred to the ASF and redirected to
>> the official page during incubation.
>>
>>
>>
>> Portions of Kudu are protected by pending or published patents owned by
>> Cloudera. Given the protections already granted by the Apache License, we
>> do not anticipate any explicit licensing or transfer of this intellectual
>> property.
>>
>> == External Dependencies ==
>>
>> The full set of dependencies and licenses are listed in
>> https://github.com/cloudera/kudu/blob/master/LICENSE.txt
>>
>> and summarized here:
>>
>> * '''Twitter Bootstrap''': Apache 2.0
>> * '''d3''': BSD 3-clause
>> * '''epoch JS library''': MIT
>> * '''lz4''': BSD 2-clause
>> * '''gflags''': BSD 3-clause
>> * '''glog''': BSD 3-clause
>> * '''gperftools''': BSD 3-clause
>> * '''libev''': BSD 2-clause
>> * '''squeasel''':MIT license
>> * '''protobuf''': BSD 3-clause
>> * '''rapidjson''': MIT
>> * '''snappy''': BSD 3-clause
>> * '''trace-viewer''': BSD 3-clause
>> * '''zlib''': zlib license
>> * '''llvm''': University of Illinois/NCSA Open Source (BSD-alike)
>> * '''bitshuffle''': MIT
>> * '''boost''': Boost license
>> * '''curl''': MIT
>> * '''libunwind''': MIT
>> * '''nvml''': BSD 3-clause
>> * '''cyrus-sasl''': Cyrus SASL license (BSD-alike)
>> * '''openssl''': OpenSSL License (BSD-alike)
>>
>> * '''Guava''': Apache 2.0
>> * '''StumbleUpon Async''': BSD
>> * '''Apache Hadoop''': Apache 2.0
>> * '''Apache log4j''': Apache 2.0
>> * '''Netty''': Apache 2.0
>> * '''slf4j''': MIT
>> * '''Apache Commons''': Apache 2.0
>> * '''murmur''': Apache 2.0
>>
>>
>> '''Build/test-only dependencies''':
>>
>> * '''CMake''': BSD 3-clause
>> * '''gcovr''': BSD 3-clause
>> * '''gmock''': BSD 3-clause
>> * '''Apache Maven''': Apache 2.0
>> * '''JUnit''': EPL
>> * '''Mockito''': MIT
>>
>> == Cryptography ==
>>
>> Kudu does not currently include any cryptography-related code.
>>
>> == Required Resources ==
>>
>> === Mailing lists ===
>>
>> * private@kudu.incubator.apache.org (PMC)
>> * commits@kudu.incubator.apache.org (git push emails)
>> * issues@kudu.incubator.apache.org (JIRA issue feed)
>> * dev@kudu.incubator.apache.org (Gerrit code reviews plus dev discussion)
>> * user@kudu.incubator.apache.org (User questions)
>>
>>
>> === Repository ===
>>
>> * git://git.apache.org/kudu
>>
>> === Gerrit ===
>>
>> We hope to continue using Gerrit for our code review and commit workflow.
>> The Kudu team has already been in contact with Jake Farrell to start
>> discussions on how Gerrit can fit into the ASF. We know that several other
>> ASF projects and podlings are also interested in Gerrit.
>>
>>
>>
>> If the Infrastructure team does not have the bandwidth to support Gerrit,
>> we will continue to support our own instance of Gerrit for Kudu, and make
>> the necessary integrations such that commits are properly authenticated and
>> maintain sufficient provenance to uphold the ASF standards (e.g. via the
>> solution adopted by the AsterixDB podling).
>>
>> == Issue Tracking ==
>>
>> We would like to import our current JIRA project into the ASF JIRA, such
>> that our historical commit messages and code comments continue to reference
>> the appropriate bug numbers.
>>
>> == Initial Committers ==
>>
>> * Adar Dembo adar@cloudera.com
>> * Alex Feinberg alex@strlen.net
>> * Andrew Wang wang@apache.org
>> * Dan Burkert dan@cloudera.com
>> * David Alves dralves@apache.org
>> * Jean-Daniel Cryans jdcryans@apache.org
>> * Mike Percy mpercy@apache.org
>> * Misty Stanley-Jones misty@apache.org
>> * Todd Lipcon todd@apache.org
>>
>> The initial list of committers was seeded by listing those contributors who
>> have contributed 20 or more patches in the last 12 months, indicating that
>> they are active and have achieved merit through participation on the
>> project. We chose not to include other contributors who either have not yet
>> contributed a significant number of patches, or whose contributions are far
>> in the past and we don’t expect to be active within the ASF.
>>
>> == Affiliations ==
>>
>> * Adar Dembo - Cloudera
>> * Alex Feinberg - Forward Networks
>> * Andrew Wang - Cloudera
>> * Dan Burkert - Cloudera
>> * David Alves - Cloudera
>> * Jean-Daniel Cryans - Cloudera
>> * Mike Percy - Cloudera
>> * Misty Stanley-Jones - Cloudera
>> * Todd Lipcon - Cloudera
>>
>> == Sponsors ==
>>
>> === Champion ===
>>
>> * Todd Lipcon
>>
>> === Nominated Mentors ===
>>
>> * Jake Farrell - ASF Member and Infra team member, Acquia
>> * Brock Noland - ASF Member, StreamSets
>> * Michael Stack - ASF Member, Cloudera
>> * Jarek Jarcec Cecho - ASF Member, Cloudera
>> * Chris Mattmann - ASF Member, NASA JPL and USC
>> * Julien Le Dem - Incubator PMC, Dremio
>> * Carl Steinbach - ASF Member, LinkedIn
>>
>> === Sponsoring Entity ===
>>
>> The Apache Incubator
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message