incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Cabrera <l...@toolazydogs.com>
Subject Re: [VOTE] Accept Tajo into the Apache Incubator
Date Thu, 28 Feb 2013 21:24:15 GMT
+1

Regards,
Alan

On Feb 28, 2013, at 10:11 AM, Hyunsik Choi wrote:

> Hi Folks,
> 
> I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
> The vote will close on Mar 7 at 6:00 PM (PST).
> 
> [] +1 Accept Tajo into the Apache incubator
> [] +0 Don't care.
> [] -1 Don't accept Tajo into the incubator because...
> 
> Full proposal is pasted at the bottom on this email, and the corresponding
> wiki is http://wiki.apache.org/incubator/TajoProposal.
> 
> Only VOTEs from Incubator PMC members are binding, but all are welcome to
> express their thoughts.
> 
> Thanks,
> Hyunsik
> 
> PS: From the initial discussion, the main changes are that I've added 4 new
> committers. Also, I've revised some description of Known Risks because the
> initial committers have been diverse.
> 
> ----------------
> Tajo Proposal
> 
> = Abstract =
> 
> Tajo is a distributed data warehouse system for Hadoop.
> 
> 
> = Proposal =
> 
> Tajo is a relational and distributed data warehouse system for Hadoop. Tajo
> is designed for low-latency and scalable ad-hoc queries, online aggregation
> and ETL on large-data sets by leveraging advanced database techniques. It
> supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
> Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
> and it has its own query engine which allows direct control of distributed
> execution and data flow. As a result, Tajo has a variety of query
> evaluation strategies and more optimization opportunities. In addition,
> Tajo will have a native columnar execution and and its optimizer. Tajo will
> be an alternative choice to Hive/Pig on the top of MapReduce.
> 
> 
> = Background =
> 
> Big data analysis has gained much attention in the industrial. Open source
> communities have proposed scalable and distributed solutions for ad-hoc
> queries on big data. However, there is still room for improvement. Markets
> need more faster and efficient solutions. Recently, some alternatives
> (e.g., Cloudera's Impala and Amazon Redshift) have come out.
> 
> 
> = Rationale =
> 
> There are a variety of open source distributed execution engines (e.g.,
> hive, and pig) running on the top of MapReduce. They are limited by MR
> framework. They cannot directly control distributed execution and data
> flow, and they just use MR framework. So, they have limited query
> evaluation strategies and optimization opportunities. It is hard for them
> to be optimized for a certain type of data processing.
> 
> 
> = Initial Goals =
> 
> The initial goal is to write more documents to describe Tajo's internal. It
> will be helpful to recruit more committers and to build a solid community.
> Then, we will make milestones for short/long term plans.
> 
> 
> = Current Status =
> 
> Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
> selection, projection, group-by, join, union and sort) except for nested
> queries. Tajo provides various row/column storage formats, such as CSV,
> RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
> also has a rudimentary ETL feature to transform one data format to another
> data format. In addition, Tajo provides hash and range repartitions. By
> using both repartition methods, Tajo processes aggregation, join, and sort
> queries over a number of cluster nodes. To evaluate the performance, we
> have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
> 
> 
> == Meritocracy ==
> 
> We will discuss the milestone and the future plan in an open forum. We plan
> to encourage an environment that supports a meritocracy. The contributors
> will have different privileges according to their contributions.
> 
> 
> == Community ==
> 
> Big data analysis has gained attention from open source communities,
> industrial and academic areas. Some projects related to Hadoop already have
> very large and active communities. We expect that Tajo also will establish
> an active community. Since Tajo already works for some features and is in
> the alpha stage, it will attract a large community soon.
> 
> 
> == Core Developers ==
> 
> Core developers are a diverse group of developers, many of which are very
> experienced in open source and the Apache Hadoop ecosystem.
> 
> * Eli Reisman <ereisman AT apache DOT org>
> 
> * Henry Saputra <hsaputra AT apache DOT org>
> 
> * Hyunsik Choi <hyunsik AT apache DOT org>
> 
> * Jae Hwa Jung <jhjung AT gruter DOT com>
> 
> * Jihoon Son <ghoonson AT gmail DOT com>
> 
> * Jin Ho Kim <jhkim AT gruter DOT com>
> 
> * Roshan Sumbaly <rsumbaly AT gmail DOT com>
> 
> * Sangwook Kim <swkim AT inervit DOT com>
> 
> * Yi A Liu <yi DOT a DOT liu AT intel DOT com>
> 
> 
> == Alignment ==
> 
> Tajo employs Apache Hadoop Yarn as a resource management platform for large
> clusters. It uses HDFS as a primary storage layer. It already supports
> Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In
> addition, we have a plan to integrate Tajo with other products of Hadoop
> ecosystem. Tajo's modules are well organized, and these modules can also be
> used for other projects.
> 
> 
> = Known Risks =
> 
> == Orphaned Products ==
> 
> Most of codes have been developed by only two core developers, who are
> Hyunsik Choi and Jihoon Son. It may be a risk of being orphaned. However,
> they are guaranteed to have enough time to develop Tajo for years. As you
> can see the commit history, they have participated in this project for
> about two years. In addition, the initial committers are diverse, and Tajo
> has been supported by two IT companies in South Korea. So, the risk of
> being orphaned is very low. Later, we will be eager to recruit additional
> committers in order to eliminate this risk.
> 
> 
> == Inexperience with Open Source ==
> 
> Most of the initial committers have experience working on open source
> projects. In particular, Eli, Henry, and Hyunsik have experience as
> committers and PMC members on other Apache projects.
> 
> 
> == Homogeneous Developers ==
> 
> Although they are a diverse group of developers, what a half of core
> developers are in South Korea may be a risk. This is because their offline
> activities are limited due to their location. Since we surely recognize
> this risk, we will write more complete documents and presentation materials
> in order to disseminate Tajo's internal and users guide. In addition, to
> mitigate this risk we will be eager to recruit additional committers around
> the world.
> 
> 
> == Reliance on Salaried Developers ==
> 
> It is expected that Tajo development will occur on both salaried time and
> on volunteer time. Hyunsik and Jihoon belong to Database lab., Korea Univ.
> They will be paid by the lab to contribute Tajo for years. Jin Ho and
> Sangwook are paid by their employer to contribute to this project. Other
> developers will contribute to this project on volunteer time. In addition,
> we will be eager to recruit additional committers including salaried and
> non-salaried developers.
> 
> 
> == Relationships with Other Apache Products ==
> 
> Tajo has some overlapping function with Apache Incubator Drill. However,
> Tajo is even more mature than Drill. In addition, there are some
> significant differences. Drill is a distributed system specialized for
> low-latency query processing by using column operations and intermediate
> data streaming. Drill has very simple query optimizer. However, some
> queries including big-big table join and sort are not available in that
> manner. Drill will support some of query types.
> 
> In contrast, Tajo has advanced query optimization system. Tajo mainly aims
> at scalable and efficient processing on all query types. By using the query
> optimizer, Tajo will only chase low latency query processing for some query
> types that can be executed in online aggregation manner.
> 
> Besides, Tez has some overlapping functions with Tajo. However, Tez is in
> the pre-alpha stage and may be a prototype. When Tez becomes feasible, Tajo
> could use Tez as an underlying framework according to the applicability.
> However, Tajo will still use its row/native columnar execution engine and
> its optimizer. Tajo may be potentially the first application of Tez.
> 
> 
> == A Excessive Fascination with the Apache Brand ==
> 
> We believe that the Apache brand will help us to find contributors and to
> grow the community. The community and development process will make this
> project more stable and help establish ubiquitous APIs. In addition, Tajo
> depends other project in Apache Hadoop ecosystem. We expect that
> cooperative work occurs with other projects in the same place.
> 
> 
> = Documentation =
> 
> Tajo's demonstration paper was accepted to IEEE ICDE 2013. Since this
> conference will be held in April 2013, we cannot publicly show the paper.
> Instead, we attached some presentation material. Checkout this slide (
> http://www.slideshare.net/hyunsikchoi/tajo-intro)
> 
> In addition, some documents (e.g., getting started) are available at
> http://tajo-project.github.com/tajo/.
> 
> 
> = Initial Source =
> 
> The initial source code has been developed in the Database Lab. Korea Univ.
> This is implemented in Java and has almost 100,000 lines except for parser
> and protobuf generated codes. Currently, initial source code is already
> available on GitHub at [[https://github.com/tajo-project/tajo]].
> 
> 
> = Source and Intellectual Property Submission Plan =
> 
> We intend the entire code base to be licensed under the Apache License,
> Version 2.0.
> 
> 
> = External Dependencies =
> 
> The required dependencies are all Apache compatible licenses. The following
> components with non-Apache licenses are enumerated:
> 
> * Google Guava
> 
> * Google Protocol Buffer
> 
> * Antlr
> 
> * Mockito
> 
> * JLine2
> 
> 
> = Cryptography =
> 
> Tajo will depend on secure Hadoop that can optionally use Kerberos.
> 
> 
> = Required Resources =
> 
> == Mailling List ==
> 
> * tajo-private (with moderated subscriptions)
> 
> * tajo-dev
> 
> * tajo-commits
> 
> 
> == Subversion Directory ==
> 
> https://git-wip-us.apache.org/repos/asf/tajo.git
> 
> 
> == Issue Tracking ==
> 
> Jira Tajo (TAJO)
> 
> 
> == Other Resources ==
> 
> * Continuous Integration
> 
>   * Jenkins
> 
> * Wiki
> 
>   * http://wiki.apache.org/tajo
> 
> 
> = Initial Committers =
> 
> * Eli Reisman <ereisman AT apache DOT org>
> 
> * Henry Saputra <hsaputra AT apache DOT org>
> 
> * Hyunsik Choi <hyunsik AT apache DOT org>
> 
> * Jae Hwa Jung <jhjung AT gruter DOT com>
> 
> * Jihoon Son <ghoonson AT gmail DOT com>
> 
> * Jin Ho Kim <jhkim AT gruter DOT com>
> 
> * Roshan Sumbaly <rsumbaly AT gmail DOT com>
> 
> * Sangwook Kim <swkim AT inervit DOT com>
> 
> * Yi A Liu <yi DOT a DOT liu AT intel DOT com>
> 
> 
> = Affiliations =
> 
> * Eli Reisman (Hortonworks)
> 
> * Henry Saputra (Platfora)
> 
> * Hyunsik Choi (Database Lab., Korea University)
> 
> * Jae Hwa Jung (Gruter)
> 
> * Jihoon Son (Database Lab., Korea University)
> 
> * Jin Ho Kim (Gruter)
> 
> * Roshan Sumbaly (LinkedIn)
> 
> * Sangwook Kim (Inervit)
> 
> * Yi A Liu (Intel)
> 
> 
> The nominated mentors are employees of NASA JPL, LinkedIn, and Hortonworks.
> 
> * Chris Mattmann - NASA JPL
> 
> * Jakob Homan - LinkedIn
> 
> * Owen O'Malley - Hortonworks
> 
> 
> = Sponsors =
> 
> == Champion ==
> 
> * Jakob Homan <ghoman AT apache DOT org>
> 
> 
> == Nominated Mentors ==
> 
> * Chris Mattmann <chris DOT a DOT mattmann AT jpl DOT nasa DOT gov>
> 
> * Jakob Homan <jghoman AT apache DOT org>
> 
> * Owen O'Malley <omalley AT apache DOT org>
> 
> 
> == Sponsoring Entity ==
> 
> Apache Incubator


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message