incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "TajoProposal" by HyunsikChoi
Date Fri, 22 Feb 2013 14:58:55 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "TajoProposal" page has been changed by HyunsikChoi:
http://wiki.apache.org/incubator/TajoProposal

New page:
= Abstract =

Tajo is a distributed data warehouse system for Hadoop.

= Proposal =
Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for
low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by
leveraging advanced database techniques. It supports SQL standards. Tajo is inspired by Dryad,
MapReduce, Dremel, Scope, and parallel databases. It has its own query engine which allows
direct control of distributed execution and data flow. As a result, Tajo has a variety of
query evaluation strategies and more optimization opportunities. In addition, Tajo will have
a native columnar execution and and its optimizer. Tajo will be an alternative choice to Hive/Pig
on the top of MapReduce.

= Background =
Big data analysis has gained much attention in the industrial. Open source communities have
proposed scalable and distributed solutions for ad-hoc queries on big data. However, there
is still room for improvement. Markets need more faster and efficient solutions. Recently,
some alternatives (e.g., Cloudera's Impala and Amazon Redshift) have come out.

= Rationale =
There are a variety of open source distributed execution engines (e.g., hive, and pig) running
on the top of MapReduce. They are limited by MR framework. They cannot directly control distributed
execution and data flow, and they just use MR framework. So, they have limited query evaluation
strategies and optimization opportunities. It is hard for them to be optimized for a certain
type of data processing.

= Initial Goals =

The initial goal is to write more documents to describe Tajo's internal. It will be helpful
to recruit more committers and to build a solid community. Then, we will make milestones for
short/long term plans.

= Current Status =

Tajo is under alpha stage. Users can submit usual SQL queries except for nested queries. The
queries are executed across a number of clusters. We have carried out benchmark test using
TPC-H 1TB on 32 cluster nodes. Tajo already supports various row/column file formats, such
as CSV, RowFile (we have developed), RCFile, and Trevni.

== Meritocracy ==

We will discuss the milestone and the future plan in an open forum. We plan to encourage an
environment that supports a meritocracy. The contributors will have different privileges according
to their contributions.

== Community ==
Big data analysis has gained attention from open source communities, industrial and academic
areas. Some projects related to Hadoop already have very large and active communities. We
expect that Tajo also will establish an active community. Since Tajo is relatively mature
than other projects that aims at low-latency projects and is already under alpha stage, it
will attract a large community soon.

== Core Developers ==
Core developers are very experienced in the Apache Hadoop ecosystem. To achieve more diversity
of developers, we will be eager to recruit developers from diverse companies.

 * Hyunsik Choi <hyunsik at apache dot org>
 * Jihoon Son <ghoonson at gmail dot com>
 * Jin Ho Kim <jhkim at gruter dot com>
 * Sangwook Kim <swkim at inervit dot com>

== Alignment ==
Tajo employs Apache Hadoop Yarn as a resource management platform for large clusters. It uses
HDFS as a primary storage layer. It already supports Hadoop-related data formats (RCFile,
Trevni) and will support ORC file. In addition, we have a plan to integrate Tajo with other
products of Hadoop ecosystem. Tajo's modules are well organized, and these modules can also
be used for other projects.

= Known Risks =

== Orphaned Products ==
Most of codes have been developed by two core developers, who are Hyunsik Choi and Jihoon
Son. However, they are guaranteed to have enough time to develop Tajo for years. As you can
see the commit history, they have participated in this project for about two years. Recently,
Tajo has been supported by two IT companies in Korea. In addition, we will be eager to recruit
additional committers in order to mitigate this risk.

== Inexperience with Open Source ==
Most of the initial committers have experience working on open source projects. Hyunsik Choi
has experience as committers and PMC on other Apache projects.

== Homogeneous Developers ==
Although they have three affiliations, what the core developers are all in South Korea is
a risk. This is because their offline activities are limited due to their location. Since
we surely recognize this risk, we will write more complete documents and presentation materials
as early as possible. Then, we will be eager to recruit additional committers around the world.

== Reliance on Salaried Developers ==
Hyunsik Choi and Jihoon Son belong to Database lab., Kroea Univ. They will be paid by the
lab to contribute Tajo for two years. Other core developers are paid by their employer to
contribute to this project. In addition, we will be eager to recrute additional committers
including salaried and non-salaried developers.

== Relationships with Other Apache Products ==
Tajo has some overlapping function with Apache Incubator Drill. However, it is more mature
than Drill. In addition, there are some significant differences. Drill is a distributed system
specialized for low-latency query processing by using column operation and streaming intermediate
data. Drill has very simple query optimizer. However, some queries including big-big table
join and sort are not available in that manner. Drill will support some of query types.

In contrast, Tajo has advanced query optimization system. Tajo mainly aims at scalable and
efficient processing on all query types. By using the query optimizer, Tajo will only chase
low latency query processing for some query types that can be executed in online aggregation
manner.

Besides, Tez being voted in incubator-general@a.o has some overlapping functions with Tajo.
However, Tez is under pre-alpha stage and may be a prototype. When Tez becomes feasible, Tajo
could use Tez as an underlying framework according to the applicability. However, Tajo will
still use its row/native columnar execution engine and its optimizer. Tajo may be potentially
the first application of Tez.

== A Excessive Fascination with the Apache Brand ==
We believe that the Apache brand will help us to find contributors and to grow the community.
The community and development process will make this project more stable and ubiquitous APIs.
In addition, Tajo depends other project in Apache Hadoop ecosystem. We expect that cooperative
work occurs with other projects in the same place.

= Documentation =
Tajo's demonstration paper was accepted to IEEE ICDE 2013. Since this conference will be held
in April 2013, we cannot publicly show the paper. So, we attached some presentation material.
Checkout this [[http://dbserver.korea.ac.kr/~hyunsik/files/tajo_intro.pdf|Slide]].

In addition, some documents (e.g., getting started) are available on [[http://tajo-project.github.com/tajo/]]

= Initial Source =
The initial source code has been developed in the Database Lab. Korea Univ. This is implemented
in Java and has almost 100,000 lines except for parser and protobuf generated codes. Currently,
initial source code is already avilable on GitHub at [[https://github.com/tajo-project/tajo]].

= Source and Intellectual Property Submission Plan =

We intend the entire code base to be licensed under the Apache License, Version 2.0.

= External Dependencies =
The required dependencies are all Apache compatible licenses. The following components with
non-Apache licenses are enumerated:

 * Google Guava
 * Google Protocol Buffer
 * Antlr
 * Mockito
 * JLine2

= Cryptography =
 
Tajo will depend on secure Hadoop that can optionally use Kerberos.

= Required Resources =
== Mailling List ==
 * tajo-private (with moderated subscriptions)
 * tajo-dev
 * tajo-commits
 * tajo-user

== Subversion Directory ==
https://git-wip-us.apache.org/repos/asf/tajo.git

== Issue Tracking ==
Jira Tajo (TAJO)

== Other Resources ==
 * Continuous Integration
  * Jenkins
 * Wiki
  * http://cwiki.apache.org

= Initial Committers =
 * Hyunsik Choi <hyunsik at apache dot org>
 * Jihoon Son <ghoonson at gmail dot com>
 * Jin Ho Kim <jhkim at gruter dot com>
 * Sangwook Kim <swkim at inervit dot com>

= Affiliations =
 * Hyunsik Choi (Database Lab. Korea University)
 * Jihoon Son (Database Lab. Korea University)
 * Jin Ho Kim (Gruter)
 * Sangwook Kim (Inervit)

= Sponsors =

== Champion ==

 * Jakob Homan <ghoman at apache dot org>

== Nominated Mentors ==

 * Owen O'Malley <omalley at apache dot org>, Architect at Hortonworks. Committer for
Hadoop, Ambari.

== Sponsoring Entity ==
Apache Incubator

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message