incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Farrell <jfarr...@apache.org>
Subject Re: [VOTE] Accept CarbonData into the Apache Incubator
Date Thu, 26 May 2016 13:42:10 GMT
+1 (binding)

-Jake

On Wed, May 25, 2016 at 4:24 PM, Jean-Baptiste Onofré <jb@nanthrax.net>
wrote:

> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing customer
> experiences for telecom carriers, enterprises, and consumers on big data,
> In order to satisfy the following customer requirements, we created a new
> Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all
> fields.
>  * Fast data loading speed and support incremental load in period of
> minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in the
> Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>  1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>  2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression schema
> ,complex data type etc. And CarbonData has following unique features:
>
> ==== Indexing ====
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3 types
> of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each data
> blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
>  Inverted index is widely used in search engine. By using this index, it
> helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>  For all columns, minmax index is created so that processing/query engine
> can skip scan that is not required.
>
> ==== Global Dictionary ====
>
> Besides I/O reduction, CarbonData accelerates computation by using global
> dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
> ==== Column Group ====
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
> ==== Optimized for multiple use cases ====
>
> CarbonData indices and dictionary is highly configurable. To make storage
> optimized for different use cases, user can configure what to index, so
> user can decide and tune the format before loading data into CarbonData.
>
> For example
>
> || Use Case || Supporting Features ||
> || Interactive OLAP query || Columnar format, Multi-dimensional Key (B+
> Tree index), Minmax index, Inverted index ||
> || High throughput scan || Global dictionary, Minmax index ||
> || Low latency point query || Multi-dimensional Key (B+ Tree index),
> Partitioning ||
> || Individual record query || Column group, Global dictionary ||
>
> === BigData Processing Framework Integration ===
>
>  * CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format with
> data processing framework.
>  * CarbonData provides deep integration with Apache Spark including
> predicate push down, column pruning, aggregation push down etc. So users
> can use Spark SQL to connect and query from CarbonData.
>  * CarbonData can integrate with various big data Query/Processing
> framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
>
> Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>
> == Initial Goals ==
>
> Our initial goals are to bring CarbonData into the ASF, transition
> internal engineering processes into the open, and foster a collaborative
> development model according to the "Apache Way".
>
> == Current Status ==
>
> CarbonData is production ready and already provide a large set of features.
> The current license is already Apache 2.0.
>
> == Meritocracy ==
>
> We intend to radically expand the initial developer and user community by
> running the project in accordance with the "Apache Way". Users and new
> contributors will be treated with respect and welcomed. By participating in
> the community and providing quality patches/support that move the project
> forward, they will earn merit. They also will be encouraged to provide
> non-code contributions (documentation, events, community management, etc.)
> and will gain merit for doing so. Those with a proven support and quality
> track record will be encouraged to become committers.
>
> == Community ==
>
> If CarbonData is accepted for incubation, the primary initial goal is to
> build a large community. We really trust that CarbonData will become a key
> project for big data column-like platforms, and so, we bet on a large
> community of users and developers.
>
> == Known Risks ==
>
> Development has been sponsored mostly by a one company.For the project to
> fully transition to the Apache Way governance model, development must shift
> towards the meritocracy-centric model of growing a community of
> contributors balanced with the needs for extreme stability and core
> implementation coherency.
>
> == Orphaned products ==
>
> Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration with
> sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.
>
> == Inexperience with Open Source ==
>
> Huawei has been developing and using open source software since a long
> time. Additionally, several ASF veterans agreed to mentor the project and
> are listed in this proposal. The project will rely on their guidance and
> collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.
>
> == Reliance on Salaried Developers ==
>
> Most of the contributors are paid to work in big data space. While they
> might wander from their current employers, they are unlikely to venture far
> from their core expertises and thus will continue to be engaged with the
> project regardless of their current employers.
>
> == An Excessive Fascination with the Apache Brand ==
>
> While we intend to leverage the Apache ‘branding’ when talking to other
> projects as testament of our project’s ‘neutrality’, we have no plans for
> making use of Apache brand in press releases nor posting billboards
> advertising acceptance of CarbonData into Apache Incubator.
>
> == Initial Source ==
>
> https://github.com/HuaweiBigData/carbondata.git
>
> == External Dependencies ==
>
> All external dependencies are licensed under an Apache 2.0 license or
> Apache-compatible license. As we grow the Carbondata community we will
> configure our build process to require and validate all contributions
> and dependencies are licensed under the Apache 2.0 license or are under
> an Apache-compatible license.
>
>  * Apache Spark
>  * Apache Hadoop
>  * Apache Maven
>  * Apache Commons
>  * Apache Log4j
>  * Apache Thrift
>  * Apache Zookeeper
>  * Scala
>  * Snappy
>  * Kettle (Pentaho)
>  * Eigenbase
>  * Fastutil
>  * GSON
>  * Jmockit
>  * Junit
>
> == Required Resources ==
>
> === Mailing lists ===
>
>  * private@carbondata.incubator.apache.org (moderated subscriptions)
>  * commits@carbondata.incubator.apache.org
>  * dev@carbondata.incubator.apache.org
>  * issues@carbondata.incubator.apache.org
>
> === Git Repository ===
>
>  * https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git
>
> === Issue Tracking ===
>
>  * JIRA Project CarbonData (CarbonData)
>
> === Initial Committers ===
>
>  * Liang Chenliang
>  * Jean-Baptiste Onofré
>  * Henry Saputra
>  * Uma Maheswara Rao G
>  * Jenny MA
>  * Jacky Likun
>  * Vimal Das Kammath
>  * Jarray Qiuheng
>
> === Affiliations ===
>
>  * Huawei: Liang Chenliang
>  * Talend: Jean-Baptiste Onofré
>  * Ebay: Henry Saputra
>  * Intel: Uma Maheswara Rao G
>
> === Sponsors ===
>
> === Champion ===
>
>  * Jean-Baptiste Onofré - Apache Member
>
> === Mentors ===
>
>  * Henry Saputra (eBay)
>  * Jean-Baptiste Onofré (Talend)
>  * Uma Maheswara Rao G (Intel)
>
> === Sponsoring Entity ===
>
> The Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message