incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "KylinProposal" by lukehan
Date Fri, 14 Nov 2014 15:18:13 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "KylinProposal" page has been changed by lukehan:

  ## page was copied from DrillProposal
- = Drill =
+ = Kylin =
  == Abstract ==
- Drill is a distributed system for interactive analysis of large-scale datasets, inspired
by [[|Google's Dremel]].
+ Kylin is a distributed and scalable OLAP engine built on Hadoop to support extremely large
  == Proposal ==
- Drill is a distributed system for interactive analysis of large-scale datasets. Drill is
similar to Google's Dremel, with the additional flexibility needed to support a broader range
of query languages, data formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes
of data and trillions of records in seconds.
+ Kylin is an open source Distributed Analytics Engine that provides multi-dimensional analysis
(MOLAP) on Hadoop.  Kylin is designed to accelerate analytics on Hadoop by allowing the use
of SQL-compatible tools.  Kylin provides a SQL interface and multi-dimensional analysis (MOLAP)
on Hadoop to support extremely large datasets and tightly integrate with Hadoop ecosystem.
+ === Overview of Kylin ===
+ Kylin platform has two parts of data processing and interactive: First, Kylin will read
data from source, Hive, and run a set of tasks including Map Reduce job, shell script to pre-calcuate
results for a specified data model, then save the resulting OLAP cube into storage such as
HBase. Once these OLAP cubes are ready, a user can submit a request from any SQL-based tool
or third party applications to Kylin’s REST server. The Server calls the Query Engine to
determine if the target dataset already exists. If so, the engine directly accesses the target
data in the form of a predefined cube, and returns the result with sub-second latency. Otherwise,
the engine is designed to route non-matching queries to whichever SQL on Hadoop tool is already
available on a Hadoop cluster, such as Hive.
+ Kylin platform includes:
+ '''Metadata Manager:'''  Kylin is a metadata-driven application. The Kylin Metadata Manager
is the key component that manages all metadata stored in Kylin including all cube metadata.
All other components rely on the Metadata Manager.
+ '''Job Engine:'''  This engine is designed to handle all of the offline jobs including shell
script, Java API, and Map Reduce jobs. The Job Engine manages and coordinates all of the jobs
in Kylin to make sure each job executes and handles failures.
+ '''Storage Engine:'''  This engine manages the underlying storage – specifically, the
cuboids, which are stored as key-value pairs. The Storage Engine uses HBase – the best solution
from the Hadoop ecosystem for leveraging an existing K-V system. Kylin can also be extended
to support other K-V systems, such as Redis.
+ '''Query Engine:'''  Once the cube is ready, the Query Engine can receive and parse user
queries. It then interacts with other components to return the results to the user.
+ '''REST Server:'''  The REST Server is an entry point for applications to develop against
Kylin.  Applications can submit queries, get results, trigger cube build jobs, get metadata,
get user privileges, and so on.
+ '''ODBC Driver:'''  To support third-party tools and applications – such as Tableau –
we have built and open-sourced an ODBC Driver.  The goal is to make it easy for users to onboard.
  == Background ==
- Many organizations have the need to run data-intensive applications, including batch processing,
stream processing and interactive analysis. In recent years open source systems have emerged
to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm,
Apache S4). In 2010 Google published a paper called "Dremel: Interactive Analysis of Web-Scale
Datasets," describing a scalable system used internally for interactive analysis of nested
data. No open source project has successfully replicated the capabilities of Dremel.
+ The challenge we face at eBay is that our data volume is becoming bigger and bigger while
our user base is becoming more diverse.  For e.g. our business users and analysts consistently
ask for minimal latency when visualizing data on Tableau and Excel. 
+ So, we worked closely with our internal analyst community and outlined the product requirements
for Kylin:
+  1. Sub-second query latency on billions of rows
+  2. ANSI SQL availability for those using SQL-compatible tools
+  3. Full OLAP capability to offer advanced functionality
+  4. Support for high cardinality and very large dimensions
+  5. High concurrency for thousands of users
+  6. Distributed and scale-out architecture for analysis in the TB to PB size range
+ Existing SQL-on-Hadoop solutions commonly need to perform partial or full table or file
scans to compute the results of queries. The cost of these large data scans can make many
queries very slow (more than a minute).  The core idea of MOLAP (multi-dimensional OLAP) is
to pre-compute data along dimensions of interest and store resulting aggregates as a "cube".
MOLAP is much faster but is inflexible.   
+ We realized that no existing product met our exact requirements externally – especially
in the open source Hadoop community.  To meet our emerging business needs, we built a platform
from scratch to support MOLAP for these business requirements and then to support more others
include ROLAP. With an excellent development team and several pilot customers, we have been
able to bring the Kylin platform into production as well as open source it.
  == Rationale ==
- There is a strong need in the market for low-latency interactive analysis of large-scale
datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified
by Google and addressed internally with a system called Dremel.
+ When data grows to petabyte scale, the process of pre-calculation of a query takes a long
time and costly and powerful hardware. However, with the benefit of Hadoop’s distributed
computing architecture, jobs can leverage hundreds or thousands of Hadoop data nodes. There
still exists a big gap between the growing volume of data and interactive analytics:
- In recent years open source systems have emerged to address the need for scalable batch
processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally
inspired by Google's internal MapReduce system, is used by thousands of organizations processing
large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not
designed to achieve the sub-second latency needed for interactive data analysis and exploration.
Drill, inspired by Google's internal Dremel system, is intended to address this need. 
+  1. Existing Business Intelligence (OLAP) platforms cannot scale out to support fast growing
+  2. Existing SQL on Hadoop projects are not designed for OLAP use cases, huge tables joins
will always take long time to scan and calculate.
+  3. No mature OLAP solution exists on Hadoop
- It is worth noting that, as explained by Google in the original paper, Dremel complements
MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often
used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype
larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees.
+ As mentioned in the background, the business requirements triggered by increase in data
volume drove eBay to invest in building a solution from scratch to offer Analytics capability
on Hadoop cluster. With Hadoop’s power of distributed computing Kylin can perform pre-calculations
in parallel and merge the final results, thereby significantly reducing the processing time.
+ To serve queries by the analyst community, Kylin generates cuboids with all possible combinations
of dimensions, and calculate all metrics at different levels. The cuboids are then integrated
to form a pre-calculated OLAP cube. All cuboids are key-value structured: keys are composites
formed from combinations of multiple dimensions and values are aggregations results for that
particular combination of dimensions. Kylin uses HBase to store cubes.  HBase is useful because
it supports efficient searches across ranges of data. 
- Like Dremel, Drill supports a nested data model with data encoded in a number of formats
such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard,
so supporting a nested data model eliminates the need to normalize the data. With that said,
flat data formats, such as CSV files, are naturally supported as a special case of nested
- The Drill architecture consists of four key components/layers:
-  * Query languages: This layer is responsible for parsing the user's query and constructing
an execution plan.  The initial goal is to support the SQL-like language used by Dremel and
[[|Google BigQuery]], which we
call DrQL. However, Drill is designed to support other languages and programming models, such
as the [[|Mongo Query Language]],
[[|Cascading]] or [[|Plume]].
-  * Low-latency distributed execution engine: This layer is responsible for executing the
physical plan. It provides the scalability and fault tolerance needed to efficiently query
petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed
execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and
can be extended with additional operators and connectors.
-  * Nested data formats: This layer is responsible for supporting various data formats. The
initial goal is to support the column-based format used by Dremel. Drill is designed to support
schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats
such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers,
Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine
is flexible enough to support column-based processing as well as row-based processing. This
is important because column-based processing can be much more efficient when the data is stored
in a column-based format, but many large data assets are stored in a row-based format that
would require conversion before use.
-  * Scalable data sources: This layer is responsible for supporting various data sources.
The initial focus is to leverage Hadoop as a data source.
- It is worth noting that no open source project has successfully replicated the capabilities
of Dremel, nor have any taken on the broader goals of flexibility (eg, pluggable query languages,
data formats, data sources and execution engine operators/connectors) that are part of Drill.
- == Initial Goals ==
- The initial goals for this project are to specify the detailed requirements and architecture,
and then develop the initial implementation including the execution engine and DrQL. 
- Like Apache Hadoop, which was built to support multiple storage systems (through the FileSystem
API) and file formats (through the InputFormat/OutputFormat APIs), Drill will be built to
support multiple query languages, data formats and data sources. The initial implementation
of Drill will support the DrQL and a column-based format similar to Dremel. 
  == Current Status ==
- Significant work has been completed to identify the initial requirements and define the
overall system architecture. The next step is to implement the four components described in
the Rationale section, and we intend to do that development as an Apache project.
  === Meritocracy ===
- We plan to invest in supporting a meritocracy. We will discuss the requirements in an open
forum. Several companies have already expressed interest in this project, and we intend to
invite additional developers to participate. We will encourage and monitor community participation
so that privileges can be extended to those that contribute. Also, Drill has an extensible/pluggable
architecture that encourages developers to contribute various extensions, such as query languages,
data formats, data sources and execution engine operators and connectors. While some companies
will surely develop commercial extensions, we also anticipate that some companies and individuals
will want to contribute such extensions back to the project, and we look forward to fostering
a rich ecosystem of extensions.
+ Kylin has been deployed in production at eBay and is processing extremely large datasets.
The platform has demonstrated great performance benefits and has proved to be a better way
for analysts to leverage data on Hadoop with a more convenient approach using their favorite
  === Community ===
- The need for a system for interactive analysis of large datasets in the open source is tremendous,
so there is a potential for a very large community. We believe that Drill's extensible architecture
will further encourage community participation. Also, related Apache projects (eg, Hadoop)
have very large and active communities, and we expect that over time Drill will also attract
a large community.
+ Kylin seeks to develop developer and user communities during incubation.
  === Core Developers ===
+ Kylin is currently being designed and developed by six engineers from eBay Inc. – Jiang
Xu, Luke Han, Yang Li, George Song, Hongbin Ma and Xiaodong Duo. In addition, some outside
contributors are actively contributing in design and development. Among them, Julian Hyde
from Hortonworks is a very important contributor. All of these core developers have deep expertise
in Hadoop and the Hadoop Ecosystem in general.
- The developers on the initial committers list include experienced distributed systems engineers:
-  * Tomer Shiran has experience developing distributed execution engines. He developed Parallel
DataSeries, a data-parallel version of the open source [[|DataSeries]]
system. He is also the author of Applying Idealized Lower-bound Runtime Models to Understand
Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer worked as a software developer
and researcher at IBM Research, Microsoft and HP Labs, and is now at MapR Technologies. He
has been active in the Hadoop community since 2009.
-  * Jason Frantz was at Clustrix, where he designed and developed the first scale-out SQL
database based on MySQL. Jason developed the distributed query optimizer that powered Clustrix.
He is now a software engineer and architect at MapR Technologies.
-  * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, and has a history
of over 30 years of contributions to open source. He is now at MapR Technologies. Ted has
been very active in the Hadoop community since the project's early days.
-  * MC Srivas is the co-founder and CTO of MapR Technologies. While at Google he worked on
Google's scalable search infrastructure. MC Srivas has been active in the Hadoop community
since 2009.
-  * Chris Wensel is the founder and CEO of Concurrent. Prior to founding Concurrent, he developed
Cascading, an Apache-licensed open source application framework enabling Java developers to
quickly and easily develop robust Data Analytics and Data Management applications on Apache
Hadoop. Chris has been involved in the Hadoop community since the project's early days.
-  * Keys Botzum was at IBM, where he worked on security and distributed systems, and is currently
at MapR Technologies. 
-  * Gera Shegalov was at Oracle, where he worked on networking, storage and database kernels,
and is currently at MapR Technologies.
-  * Ryan Rawson is the VP Engineering of Drawn to Scale where he developed Spire, a real-time
operational database for Hadoop. He is also a committer and PMC member for Apache HBase, and
has a long history of contributions to open source. Ryan has been involved in the Hadoop community
since the project's early days.
- We realize that additional employer diversity is needed, and we will work aggressively to
recruit developers from additional companies.
  === Alignment ===
- The initial committers strongly believe that a system for interactive analysis of large-scale
datasets will gain broader adoption as an open source, community driven project, where the
community can contribute not only to the core components, but also to a growing collection
of query languages and optimizers, data formats, data formats, and execution engine operators
and connectors. Drill will integrate closely with Apache Hadoop. First, the data will live
in Hadoop. That is, Drill will support Hadoop FileSystem implementations and HBase. Second,
Hadoop-related data formats will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based
tools will be provided to produce column-based formats. Fourth, Drill tables can be registered
in HCatalog. Finally, Hive is being considered as the basis of the DrQL implementation.
+ The ASF is a natural host for Kylin given that it is already the home of Hadoop, Pig, Hive,
and other emerging cloud software projects. Kylin was designed to offer OLAP capability on
Hadoop from the beginning in order to solve data access and analysis challenges in Hadoop
clusters. Kylin complements the existing Hadoop analytics area by providing a comprehensive
solution based on pre-computed views.
+ In Kylin, we are leveraging an open-source dynamic data management framework called Apache
Calcite to parse SQL and plug in our code. Apache Calcite was previously called Optiq, was
originally authored by Julian Hyde and is now an Apache Incubator project.
  == Known Risks ==
  === Orphaned Products ===
- The contributors are leading vendors in this space, with significant open source experience,
so the risk of being orphaned is relatively low. The project could be at risk if vendors decided
to change their strategies in the market. In such an event, the current committers plan to
continue working on the project on their own time, though the progress will likely be slower.
We plan to mitigate this risk by recruiting additional committers.
+ The core developers of Kylin team plan to work full time on this project. There is very
little risk of Kylin getting orphaned since at least one large company (eBay) is extensively
using it in their production Hadoop clusters. For example, currently there are 3 use cases
with more that 12+Billion rows and 1000 activity requests per day using Kylin in production.
Furthermore, since Kylin was open sourced at the beginning of October 2014, it has received
more than 280 stars and been forked nearly 100 times. Kylin has one major release so far and
and received 5 pull requests from contributors in the first month pull requests from external
sources in the last month, which further demonstrates Kylin as a very active project. We plan
to extend and diversify this community further through Apache.
  === Inexperience with Open Source ===
- The initial committers include veteran Apache members (committers and PMC members) and other
developers who have varying degrees of experience with open source projects. All have been
involved with source code that has been released under an open source license, and several
also have experience developing code with an open source development process.
+ The core developers are all active users and followers of open source. They are already
committers and contributors to the Kylin Github project. All have been involved with the source
code that has been released under an open source license, and several of them also have experience
developing code in an open source environment. Though the core set of Developers do not have
Apache Open Source experience, there are plans to onboard individuals with Apache open source
experience on to the project.
  === Homogenous Developers ===
- The initial committers are employed by a number of companies, including MapR Technologies,
Concurrent and Drawn to Scale. We are committed to recruiting additional committers from other
+ The core developers include developers from eBay, Ctrip and Hortonworks. Apache Incubation
process encourages an open and diverse meritocratic community. Apache Kylin has the required
amount of diversity with committers from three different organizations, but is also aware
that bulk of the commits come from a single entity. Kylin intends to make every possible effort
to build a diverse, vibrant and involved community and has already received substantial interest
from various organizations
  === Reliance on Salaried Developers ===
- It is expected that Drill development will occur on both salaried time and on volunteer
time, after hours. The majority of initial committers are paid by their employer to contribute
to this project. However, they are all passionate about the project, and we are confident
that the project will continue even if no salaried developers contribute to the project. We
are committed to recruiting additional committers including non-salaried developers.
+ eBay invested in Kylin as the OLAP solution on top of Hadoop clusters and some of its key
engineers are working full time on the project. In addition, since there is a growing Big
Data need for scalable OLAP solutions on Hadoop, we look forward to other Apache developers
and researchers to contribute to the project. Additional contributors, including Apache committers
have plans to join this effort shortly. Also key to addressing the risk associated with relying
on Salaried developers from a single entity is to increase the diversity of the contributors
and actively lobby for Domain experts in the BI space to contribute. Apache Kylin intends
to do this.  One approach already taken is to approach the Apache Drill project to explore
possible cooperation.
  === Relationships with Other Apache Products ===
- As mentioned in the Alignment section, Drill is closely integrated with Hadoop, Avro, Hive
and HBase in a numerous ways. For example, Drill data lives inside a Hadoop environment (Drill
operates on in situ data). We look forward to collaborating with those communities, as well
as other Apache communities. 
+ Kylin has a strong relationship and dependency with Apache Hadoop HBase, Hive and Calcite.
Being part of Apache’s Incubation community, could help with a closer collaboration among
these four projects and as well as others.
+ Kylin is likely to have substantial value to Apache Drill due to the common use of Calcite
as a query optimization engine and similar approaches between Kylin's approach to cubing and
Drill's approach to input sources.
  === An Excessive Fascination with the Apache Brand ===
- Drill solves a real problem that many organizations struggle with, and has been proven within
Google to be of significant value. The architecture is based on academic and industry research.
Our rationale for developing Drill as an Apache project is detailed in the Rationale section.
We believe that the Apache brand and community process will help us attract more contributors
to this project, and help establish ubiquitous APIs. In addition, establishing consensus among
users and developers of a Dremel-like tool is a key requirement for success of the project.
+ Kylin is proposing to enter incubation at Apache in order to help efforts to diversify the
committer-base, not so much to capitalize on the Apache brand. The Kylin project is in production
use already inside EBay, but is not expected to be an EBay product for external customers.
As such, the Kylin project is not seeking to use the Apache brand as a marketing tool.
  == Documentation ==
- Drill is inspired by Google's Dremel. Google has published a [[|paper]]
highlighting Dremel's innovative nested column-based data format and execution engine.
+ Information about Kylin can be found at The following
links provide more information about Kylin in open source:
+  * Kylin web site: [[|]]
+  * Codebase at Github: [[|]]
+  * Issue Tracking: [[|]]
+  * User community:  [[!forum/kylin-olap|!forum/kylin-olap]]
  == Initial Source ==
- The requirement and design documents are currently stored in MapR Technologies' source code
repository. They will be checked in as part of the initial code dump. Check out the [[attachment:Drill
slides.pdf|attached slides]].
+ Kylin has been under development since 2013 by a team of engineers at eBay Inc. It is currently
hosted on under an Apache license at [[|]]
+ == External Dependencies ==
+ Kylin has the following external dependencies.
+ === Basic ===
+  * JDK 1.6+
+  * Apache Maven
+  * JUnit
+  * DBUnit
+  * Log4j
+  * Slf4j
+  * Apache Commons
+  * Google Guava
+  * Jackson
+ === Hadoop ===
+  * Apache Hadoop
+  * Apache HBase
+  * Apache Hive
+  * Apache Zookeeper
+  * Apache Curator
+ === Utility ===
+  * H2
+  * JSCH
+ === REST Service ===
+  * Spring
+ === Query ===
+  * Antlr
+  * Apache Calcite (formerly Optiq)
+  * Linq4j
+ === Job ===
+  * Quartz
+ === Web build tool ===
+  * NPM
+  * Grunt
+  * bower
+ === Web ===
+  * Angular JS
+  * jQuery
+  * Bootstrap
+  * D3 JS
+  * ACE
  == Cryptography ==
- Drill will eventually support encryption on the wire. This is not one of the initial goals,
and we do not expect Drill to be a controlled export item due to the use of encryption.
+ Kylin will eventually support encryption on the wire. This is not one of the initial goals,
and we do not expect Kylin to be a controlled export item due to the use of encryption.
+ Kylin supports but does not require the Kerberos authentication mechanism to access secured
Hadoop services.
  == Required Resources ==
  === Mailing List ===
-  * drill-private
-  * drill-dev
-  * drill-user
+  * kylin-private for private PMC discussions (with moderated subscriptions)
+  * kylin-dev
+  * kylin-commits
  === Subversion Directory ===
- Git is the preferred source control system: git://
+ Git is the preferred source control system: git://  
  === Issue Tracking ===
- JIRA Drill (DRILL)
+ JIRA Kylin (KYLIN)
+ === Other Resources ===
+ The existing code already has unit tests so we will make use of existing Apache continuous
testing infrastructure. The resulting load should not be very large.
  == Initial Committers ==
-  * Tomer Shiran <tshiran at maprtech dot com>
-  * Ted Dunning <tdunning at apache dot org>
-  * Jason Frantz <jfrantz at maprtech dot com>
-  * MC Srivas <mcsrivas at maprtech dot com>
-  * Chris Wensel <chris and concurrentinc dot com>
-  * Keys Botzum <kbotzum at maprtech dot com>
-  * Gera Shegalov <gshegalov at maprtech dot com>
-  * Ryan Rawson <ryan at drawntoscale dot com>
+  * Jiang Xu < jiangxu.china at gmail dot com>
+  * Luke Han <lukhan at ebay dot com>
+  * Yang Li <yangli9 at ebay dot com>
+  * George Song <ysong1 at ebay dot com>
+  * Hongbin Ma <honma at ebay dot com>
+  * Xiaodong Duo < oranjedog at gmail dot com>
+  * Julian Hyde < jhyde at apache dot org >
+  * Ankur Bansal < abansal at ebay dot com>
  == Affiliations ==
- The initial committers are employees of MapR Technologies, Drawn to Scale and Concurrent.
The nominated mentors are employees of MapR Technologies, Lucid Imagination and Nokia.
+ The initial committers are employees of eBay Inc., Ctrip and Hortonworks. The nominated
mentors are employees of Hortonworks, MapR Technologies and Pivotal.
  == Sponsors ==
  === Champion ===
+  * Owen O’Malley < omalley at apache dot org >
- Ted Dunning (tdunning at apache dot org)
+  * Ted Dunning <tdunning at apache dot org> 
  === Nominated Mentors ===
-  * Ted Dunning <tdunning at apache dot org> – Chief Application Architect at MapR
Technologies, Committer for Lucene, Mahout and ZooKeeper.
-  * Grant Ingersoll <grant at lucidimagination dot com> – Chief Scientist at Lucid
Imagination, Committer for Lucene, Mahout and other projects.
-  * Isabel Drost <isabel at apache dot org> – Software Developer at Nokia Gate 5
GmbH, Committer for Lucene, Mahout and other projects.
+  * Owen O’Malley < omalley at apache dot org > - Apache IPMC member, Co-founder
and Senior Architect,  Hortonworks
+  * Ted Dunning < tdunning at apache dot org> - Apache IPMC member, Chief Architect,
MapR Technologies
+  * Henry Saputra <hsaputra at apache dot org> - Apache IPMC member, Pivotal
+  * Jacques Nadeau <jacques at apache dot org> (pending admission to IPMC) - Apache
Drill PMC Chair, MapR Technologies
  === Sponsoring Entity ===
- Incubator
+ We are requesting the Incubator to sponsor this project.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message