incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Baptiste Onofré ...@nanthrax.net>
Subject Re: [VOTE] Accept Tajo into the Apache Incubator
Date Thu, 28 Feb 2013 20:14:37 GMT
+1 (binding)

Regards
JB

On 02/28/2013 09:13 PM, Henry Saputra wrote:
> +1 (non-binding)
>
>
> - Henry
>
>
> On Thu, Feb 28, 2013 at 10:11 AM, Hyunsik Choi <hyunsik@apache.org> wrote:
>
>> Hi Folks,
>>
>> I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
>> The vote will close on Mar 7 at 6:00 PM (PST).
>>
>> [] +1 Accept Tajo into the Apache incubator
>> [] +0 Don't care.
>> [] -1 Don't accept Tajo into the incubator because...
>>
>> Full proposal is pasted at the bottom on this email, and the corresponding
>> wiki is http://wiki.apache.org/incubator/TajoProposal.
>>
>> Only VOTEs from Incubator PMC members are binding, but all are welcome to
>> express their thoughts.
>>
>> Thanks,
>> Hyunsik
>>
>> PS: From the initial discussion, the main changes are that I've added 4 new
>> committers. Also, I've revised some description of Known Risks because the
>> initial committers have been diverse.
>>
>> ----------------
>> Tajo Proposal
>>
>> = Abstract =
>>
>> Tajo is a distributed data warehouse system for Hadoop.
>>
>>
>> = Proposal =
>>
>> Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
>> is designed for low-latency and scalable ad-hoc queries, online aggregation
>> and ETL on large-data sets by leveraging advanced database techniques. It
>> supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
>> Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
>> and it has its own query engine which allows direct control of distributed
>> execution and data flow. As a result, Tajo has a variety of query
>> evaluation strategies and more optimization opportunities. In addition,
>> Tajo will have a native columnar execution and and its optimizer. Tajo will
>> be an alternative choice to Hive/Pig on the top of MapReduce.
>>
>>
>> = Background =
>>
>> Big data analysis has gained much attention in the industrial. Open source
>> communities have proposed scalable and distributed solutions for ad-hoc
>> queries on big data. However, there is still room for improvement. Markets
>> need more faster and efficient solutions. Recently, some alternatives
>> (e.g., Cloudera's Impala and Amazon Redshift) have come out.
>>
>>
>> = Rationale =
>>
>> There are a variety of open source distributed execution engines (e.g.,
>> hive, and pig) running on the top of MapReduce. They are limited by MR
>> framework. They cannot directly control distributed execution and data
>> flow, and they just use MR framework. So, they have limited query
>> evaluation strategies and optimization opportunities. It is hard for them
>> to be optimized for a certain type of data processing.
>>
>>
>> = Initial Goals =
>>
>> The initial goal is to write more documents to describe Tajo's internal. It
>> will be helpful to recruit more committers and to build a solid community.
>> Then, we will make milestones for short/long term plans.
>>
>>
>> = Current Status =
>>
>> Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
>> selection, projection, group-by, join, union and sort) except for nested
>> queries. Tajo provides various row/column storage formats, such as CSV,
>> RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
>> also has a rudimentary ETL feature to transform one data format to another
>> data format. In addition, Tajo provides hash and range repartitions. By
>> using both repartition methods, Tajo processes aggregation, join, and sort
>> queries over a number of cluster nodes. To evaluate the performance, we
>> have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
>>
>>
>> == Meritocracy ==
>>
>> We will discuss the milestone and the future plan in an open forum. We plan
>> to encourage an environment that supports a meritocracy. The contributors
>> will have different privileges according to their contributions.
>>
>>
>> == Community ==
>>
>> Big data analysis has gained attention from open source communities,
>> industrial and academic areas. Some projects related to Hadoop already have
>> very large and active communities. We expect that Tajo also will establish
>> an active community. Since Tajo already works for some features and is in
>> the alpha stage, it will attract a large community soon.
>>
>>
>> == Core Developers ==
>>
>> Core developers are a diverse group of developers, many of which are very
>> experienced in open source and the Apache Hadoop ecosystem.
>>
>>   * Eli Reisman <ereisman AT apache DOT org>
>>
>>   * Henry Saputra <hsaputra AT apache DOT org>
>>
>>   * Hyunsik Choi <hyunsik AT apache DOT org>
>>
>>   * Jae Hwa Jung <jhjung AT gruter DOT com>
>>
>>   * Jihoon Son <ghoonson AT gmail DOT com>
>>
>>   * Jin Ho Kim <jhkim AT gruter DOT com>
>>
>>   * Roshan Sumbaly <rsumbaly AT gmail DOT com>
>>
>>   * Sangwook Kim <swkim AT inervit DOT com>
>>
>>   * Yi A Liu <yi DOT a DOT liu AT intel DOT com>
>>
>>
>> == Alignment ==
>>
>> Tajo employs Apache Hadoop Yarn as a resource management platform for large
>> clusters. It uses HDFS as a primary storage layer. It already supports
>> Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
>> addition, we have a plan to integrate Tajo with other products of Hadoop
>> ecosystem. Tajo's modules are well organized, and these modules can also be
>> used for other projects.
>>
>>
>> = Known Risks =
>>
>> == Orphaned Products ==
>>
>> Most of codes have been developed by only two core developers, who are
>> Hyunsik Choi and Jihoon Son. It may be a risk of being orphaned. However,
>> they are guaranteed to have enough time to develop Tajo for years. As you
>> can see the commit history, they have participated in this project for
>> about two years. In addition, the initial committers are diverse, and Tajo
>> has been supported by two IT companies in South Korea. So, the risk of
>> being orphaned is very low. Later, we will be eager to recruit additional
>> committers in order to eliminate this risk.
>>
>>
>> == Inexperience with Open Source ==
>>
>> Most of the initial committers have experience working on open source
>> projects. In particular, Eli, Henry, and Hyunsik have experience as
>> committers and PMC members on other Apache projects.
>>
>>
>> == Homogeneous Developers ==
>>
>> Although they are a diverse group of developers, what a half of core
>> developers are in South Korea may be a risk. This is because their offline
>> activities are limited due to their location. Since we surely recognize
>> this risk, we will write more complete documents and presentation materials
>> in order to disseminate Tajo's internal and users guide. In addition, to
>> mitigate this risk we will be eager to recruit additional committers around
>> the world.
>>
>>
>> == Reliance on Salaried Developers ==
>>
>> It is expected that Tajo development will occur on both salaried time and
>> on volunteer time. Hyunsik and Jihoon belong to Database lab., Korea Univ.
>> They will be paid by the lab to contribute Tajo for years. Jin Ho and
>> Sangwook are paid by their employer to contribute to this project. Other
>> developers will contribute to this project on volunteer time. In addition,
>> we will be eager to recruit additional committers including salaried and
>> non-salaried developers.
>>
>>
>> == Relationships with Other Apache Products ==
>>
>> Tajo has some overlapping function with Apache Incubator Drill. However,
>> Tajo is even more mature than Drill. In addition, there are some
>> significant differences. Drill is a distributed system specialized for
>> low-latency query processing by using column operations and intermediate
>> data streaming. Drill has very simple query optimizer. However, some
>> queries including big-big table join and sort are not available in that
>> manner. Drill will support some of query types.
>>
>> In contrast, Tajo has advanced query optimization system. Tajo mainly aims
>> at scalable and efficient processing on all query types. By using the query
>> optimizer, Tajo will only chase low latency query processing for some query
>> types that can be executed in online aggregation manner.
>>
>> Besides, Tez has some overlapping functions with Tajo. However, Tez is in
>> the pre-alpha stage and may be a prototype. When Tez becomes feasible, Tajo
>> could use Tez as an underlying framework according to the applicability.
>> However, Tajo will still use its row/native columnar execution engine and
>> its optimizer. Tajo may be potentially the first application of Tez.
>>
>>
>> == A Excessive Fascination with the Apache Brand ==
>>
>> We believe that the Apache brand will help us to find contributors and to
>> grow the community. The community and development process will make this
>> project more stable and help establish ubiquitous APIs. In addition, Tajo
>> depends other project in Apache Hadoop ecosystem. We expect that
>> cooperative work occurs with other projects in the same place.
>>
>>
>> = Documentation =
>>
>> Tajo's demonstration paper was accepted to IEEE ICDE 2013. Since this
>> conference will be held in April 2013, we cannot publicly show the paper.
>> Instead, we attached some presentation material. Checkout this slide (
>> http://www.slideshare.net/hyunsikchoi/tajo-intro)
>>
>> In addition, some documents (e.g., getting started) are available at
>> http://tajo-project.github.com/tajo/.
>>
>>
>> = Initial Source =
>>
>> The initial source code has been developed in the Database Lab. Korea Univ.
>> This is implemented in Java and has almost 100,000 lines except for parser
>> and protobuf generated codes. Currently, initial source code is already
>> available on GitHub at [[https://github.com/tajo-project/tajo]].
>>
>>
>> = Source and Intellectual Property Submission Plan =
>>
>> We intend the entire code base to be licensed under the Apache License,
>> Version 2.0.
>>
>>
>> = External Dependencies =
>>
>> The required dependencies are all Apache compatible licenses. The following
>> components with non-Apache licenses are enumerated:
>>
>>   * Google Guava
>>
>>   * Google Protocol Buffer
>>
>>   * Antlr
>>
>>   * Mockito
>>
>>   * JLine2
>>
>>
>> = Cryptography =
>>
>>   Tajo will depend on secure Hadoop that can optionally use Kerberos.
>>
>>
>> = Required Resources =
>>
>> == Mailling List ==
>>
>>   * tajo-private (with moderated subscriptions)
>>
>>   * tajo-dev
>>
>>   * tajo-commits
>>
>>
>> == Subversion Directory ==
>>
>> https://git-wip-us.apache.org/repos/asf/tajo.git
>>
>>
>> == Issue Tracking ==
>>
>> Jira Tajo (TAJO)
>>
>>
>> == Other Resources ==
>>
>>   * Continuous Integration
>>
>>     * Jenkins
>>
>>   * Wiki
>>
>>     * http://wiki.apache.org/tajo
>>
>>
>> = Initial Committers =
>>
>>   * Eli Reisman <ereisman AT apache DOT org>
>>
>>   * Henry Saputra <hsaputra AT apache DOT org>
>>
>>   * Hyunsik Choi <hyunsik AT apache DOT org>
>>
>>   * Jae Hwa Jung <jhjung AT gruter DOT com>
>>
>>   * Jihoon Son <ghoonson AT gmail DOT com>
>>
>>   * Jin Ho Kim <jhkim AT gruter DOT com>
>>
>>   * Roshan Sumbaly <rsumbaly AT gmail DOT com>
>>
>>   * Sangwook Kim <swkim AT inervit DOT com>
>>
>>   * Yi A Liu <yi DOT a DOT liu AT intel DOT com>
>>
>>
>> = Affiliations =
>>
>>   * Eli Reisman (Hortonworks)
>>
>>   * Henry Saputra (Platfora)
>>
>>   * Hyunsik Choi (Database Lab., Korea University)
>>
>>   * Jae Hwa Jung (Gruter)
>>
>>   * Jihoon Son (Database Lab., Korea University)
>>
>>   * Jin Ho Kim (Gruter)
>>
>>   * Roshan Sumbaly (LinkedIn)
>>
>>   * Sangwook Kim (Inervit)
>>
>>   * Yi A Liu (Intel)
>>
>>
>> The nominated mentors are employees of NASA JPL, LinkedIn, and Hortonworks.
>>
>>   * Chris Mattmann - NASA JPL
>>
>>   * Jakob Homan - LinkedIn
>>
>>   * Owen O'Malley - Hortonworks
>>
>>
>> = Sponsors =
>>
>> == Champion ==
>>
>>   * Jakob Homan <ghoman AT apache DOT org>
>>
>>
>> == Nominated Mentors ==
>>
>>   * Chris Mattmann <chris DOT a DOT mattmann AT jpl DOT nasa DOT gov>
>>
>>   * Jakob Homan <jghoman AT apache DOT org>
>>
>>   * Owen O'Malley <omalley AT apache DOT org>
>>
>>
>> == Sponsoring Entity ==
>>
>> Apache Incubator
>>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message