Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E29AED88D for ; Mon, 4 Mar 2013 17:00:11 +0000 (UTC) Received: (qmail 88449 invoked by uid 500); 4 Mar 2013 17:00:10 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 88244 invoked by uid 500); 4 Mar 2013 17:00:10 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 88233 invoked by uid 99); 4 Mar 2013 17:00:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Mar 2013 17:00:10 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [128.149.139.105] (HELO mail.jpl.nasa.gov) (128.149.139.105) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Mar 2013 17:00:04 +0000 Received: from mail.jpl.nasa.gov (ap-ehub-sp02.jpl.nasa.gov [128.149.137.149]) by smtp.jpl.nasa.gov (Sentrion-MTA-4.3.1/Sentrion-MTA-4.3.1) with ESMTP id r24GxDDH009407 (using TLSv1/SSLv3 with cipher AES128-SHA (128 bits) verified NO) for ; Mon, 4 Mar 2013 08:59:43 -0800 Received: from AP-EMBX-SP40.RES.AD.JPL ([169.254.7.238]) by ap-ehub-sp02.RES.AD.JPL ([fe80::dd85:7b07:1e36:7e3c%15]) with mapi id 14.02.0342.003; Mon, 4 Mar 2013 08:59:41 -0800 From: "Mattmann, Chris A (388J)" To: "general@incubator.apache.org" Subject: Re: [VOTE] Accept Tajo into the Apache Incubator Thread-Topic: [VOTE] Accept Tajo into the Apache Incubator Thread-Index: AQHOFd8Kauiwyg9c1k2Ww+4rkoelIpiVx5SA Date: Mon, 4 Mar 2013 16:59:41 +0000 Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.3.1.130117 x-originating-ip: [128.149.137.113] Content-Type: text/plain; charset="us-ascii" Content-ID: <45B4BC8E3DBBF04B8089F732722D41CA@ad.jpl> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Source-Sender: chris.a.mattmann@jpl.nasa.gov X-AUTH: Authorized X-Virus-Checked: Checked by ClamAV on apache.org +1 (binding) from me. Cheers, Chris On 2/28/13 10:11 AM, "Hyunsik Choi" wrote: >Hi Folks, > >I'd like to call a VOTE for acceptance of Tajo into the Apache incubator. >The vote will close on Mar 7 at 6:00 PM (PST). > >[] +1 Accept Tajo into the Apache incubator >[] +0 Don't care. >[] -1 Don't accept Tajo into the incubator because... > >Full proposal is pasted at the bottom on this email, and the corresponding >wiki is http://wiki.apache.org/incubator/TajoProposal. > >Only VOTEs from Incubator PMC members are binding, but all are welcome to >express their thoughts. > >Thanks, >Hyunsik > >PS: From the initial discussion, the main changes are that I've added 4 >new >committers. Also, I've revised some description of Known Risks because the >initial committers have been diverse. > >---------------- >Tajo Proposal > >=3D Abstract =3D > >Tajo is a distributed data warehouse system for Hadoop. > > >=3D Proposal =3D > >Tajo is a relational and distributed data warehouse system for Hadoop. >Tajo >is designed for low-latency and scalable ad-hoc queries, online >aggregation >and ETL on large-data sets by leveraging advanced database techniques. It >supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel, >Scope, and parallel databases. Tajo uses HDFS as a primary storage layer, >and it has its own query engine which allows direct control of distributed >execution and data flow. As a result, Tajo has a variety of query >evaluation strategies and more optimization opportunities. In addition, >Tajo will have a native columnar execution and and its optimizer. Tajo >will >be an alternative choice to Hive/Pig on the top of MapReduce. > > >=3D Background =3D > >Big data analysis has gained much attention in the industrial. Open source >communities have proposed scalable and distributed solutions for ad-hoc >queries on big data. However, there is still room for improvement. Markets >need more faster and efficient solutions. Recently, some alternatives >(e.g., Cloudera's Impala and Amazon Redshift) have come out. > > >=3D Rationale =3D > >There are a variety of open source distributed execution engines (e.g., >hive, and pig) running on the top of MapReduce. They are limited by MR >framework. They cannot directly control distributed execution and data >flow, and they just use MR framework. So, they have limited query >evaluation strategies and optimization opportunities. It is hard for them >to be optimized for a certain type of data processing. > > >=3D Initial Goals =3D > >The initial goal is to write more documents to describe Tajo's internal. >It >will be helpful to recruit more committers and to build a solid community. >Then, we will make milestones for short/long term plans. > > >=3D Current Status =3D > >Tajo is in the alpha stage. Users can execute usual SQL queries (e.g., >selection, projection, group-by, join, union and sort) except for nested >queries. Tajo provides various row/column storage formats, such as CSV, >RowFile (a row-store file we have implemented), RCFile, and Trevni, and it >also has a rudimentary ETL feature to transform one data format to another >data format. In addition, Tajo provides hash and range repartitions. By >using both repartition methods, Tajo processes aggregation, join, and sort >queries over a number of cluster nodes. To evaluate the performance, we >have carried out benchmark test using TPC-H 1TB on 32 cluster nodes. > > >=3D=3D Meritocracy =3D=3D > >We will discuss the milestone and the future plan in an open forum. We >plan >to encourage an environment that supports a meritocracy. The contributors >will have different privileges according to their contributions. > > >=3D=3D Community =3D=3D > >Big data analysis has gained attention from open source communities, >industrial and academic areas. Some projects related to Hadoop already >have >very large and active communities. We expect that Tajo also will establish >an active community. Since Tajo already works for some features and is in >the alpha stage, it will attract a large community soon. > > >=3D=3D Core Developers =3D=3D > >Core developers are a diverse group of developers, many of which are very >experienced in open source and the Apache Hadoop ecosystem. > > * Eli Reisman > > * Henry Saputra > > * Hyunsik Choi > > * Jae Hwa Jung > > * Jihoon Son > > * Jin Ho Kim > > * Roshan Sumbaly > > * Sangwook Kim > > * Yi A Liu > > >=3D=3D Alignment =3D=3D > >Tajo employs Apache Hadoop Yarn as a resource management platform for >large >clusters. It uses HDFS as a primary storage layer. It already supports >Hadoop-related data formats (RCFile, Trevni) and will support ORC file. In >addition, we have a plan to integrate Tajo with other products of Hadoop >ecosystem. Tajo's modules are well organized, and these modules can also >be >used for other projects. > > >=3D Known Risks =3D > >=3D=3D Orphaned Products =3D=3D > >Most of codes have been developed by only two core developers, who are >Hyunsik Choi and Jihoon Son. It may be a risk of being orphaned. However, >they are guaranteed to have enough time to develop Tajo for years. As you >can see the commit history, they have participated in this project for >about two years. In addition, the initial committers are diverse, and Tajo >has been supported by two IT companies in South Korea. So, the risk of >being orphaned is very low. Later, we will be eager to recruit additional >committers in order to eliminate this risk. > > >=3D=3D Inexperience with Open Source =3D=3D > >Most of the initial committers have experience working on open source >projects. In particular, Eli, Henry, and Hyunsik have experience as >committers and PMC members on other Apache projects. > > >=3D=3D Homogeneous Developers =3D=3D > >Although they are a diverse group of developers, what a half of core >developers are in South Korea may be a risk. This is because their offline >activities are limited due to their location. Since we surely recognize >this risk, we will write more complete documents and presentation >materials >in order to disseminate Tajo's internal and users guide. In addition, to >mitigate this risk we will be eager to recruit additional committers >around >the world. > > >=3D=3D Reliance on Salaried Developers =3D=3D > >It is expected that Tajo development will occur on both salaried time and >on volunteer time. Hyunsik and Jihoon belong to Database lab., Korea Univ. >They will be paid by the lab to contribute Tajo for years. Jin Ho and >Sangwook are paid by their employer to contribute to this project. Other >developers will contribute to this project on volunteer time. In addition, >we will be eager to recruit additional committers including salaried and >non-salaried developers. > > >=3D=3D Relationships with Other Apache Products =3D=3D > >Tajo has some overlapping function with Apache Incubator Drill. However, >Tajo is even more mature than Drill. In addition, there are some >significant differences. Drill is a distributed system specialized for >low-latency query processing by using column operations and intermediate >data streaming. Drill has very simple query optimizer. However, some >queries including big-big table join and sort are not available in that >manner. Drill will support some of query types. > >In contrast, Tajo has advanced query optimization system. Tajo mainly aims >at scalable and efficient processing on all query types. By using the >query >optimizer, Tajo will only chase low latency query processing for some >query >types that can be executed in online aggregation manner. > >Besides, Tez has some overlapping functions with Tajo. However, Tez is in >the pre-alpha stage and may be a prototype. When Tez becomes feasible, >Tajo >could use Tez as an underlying framework according to the applicability. >However, Tajo will still use its row/native columnar execution engine and >its optimizer. Tajo may be potentially the first application of Tez. > > >=3D=3D A Excessive Fascination with the Apache Brand =3D=3D > >We believe that the Apache brand will help us to find contributors and to >grow the community. The community and development process will make this >project more stable and help establish ubiquitous APIs. In addition, Tajo >depends other project in Apache Hadoop ecosystem. We expect that >cooperative work occurs with other projects in the same place. > > >=3D Documentation =3D > >Tajo's demonstration paper was accepted to IEEE ICDE 2013. Since this >conference will be held in April 2013, we cannot publicly show the paper. >Instead, we attached some presentation material. Checkout this slide ( >http://www.slideshare.net/hyunsikchoi/tajo-intro) > >In addition, some documents (e.g., getting started) are available at >http://tajo-project.github.com/tajo/. > > >=3D Initial Source =3D > >The initial source code has been developed in the Database Lab. Korea >Univ. >This is implemented in Java and has almost 100,000 lines except for parser >and protobuf generated codes. Currently, initial source code is already >available on GitHub at [[https://github.com/tajo-project/tajo]]. > > >=3D Source and Intellectual Property Submission Plan =3D > >We intend the entire code base to be licensed under the Apache License, >Version 2.0. > > >=3D External Dependencies =3D > >The required dependencies are all Apache compatible licenses. The >following >components with non-Apache licenses are enumerated: > > * Google Guava > > * Google Protocol Buffer > > * Antlr > > * Mockito > > * JLine2 > > >=3D Cryptography =3D > > Tajo will depend on secure Hadoop that can optionally use Kerberos. > > >=3D Required Resources =3D > >=3D=3D Mailling List =3D=3D > > * tajo-private (with moderated subscriptions) > > * tajo-dev > > * tajo-commits > > >=3D=3D Subversion Directory =3D=3D > >https://git-wip-us.apache.org/repos/asf/tajo.git > > >=3D=3D Issue Tracking =3D=3D > >Jira Tajo (TAJO) > > >=3D=3D Other Resources =3D=3D > > * Continuous Integration > > * Jenkins > > * Wiki > > * http://wiki.apache.org/tajo > > >=3D Initial Committers =3D > > * Eli Reisman > > * Henry Saputra > > * Hyunsik Choi > > * Jae Hwa Jung > > * Jihoon Son > > * Jin Ho Kim > > * Roshan Sumbaly > > * Sangwook Kim > > * Yi A Liu > > >=3D Affiliations =3D > > * Eli Reisman (Hortonworks) > > * Henry Saputra (Platfora) > > * Hyunsik Choi (Database Lab., Korea University) > > * Jae Hwa Jung (Gruter) > > * Jihoon Son (Database Lab., Korea University) > > * Jin Ho Kim (Gruter) > > * Roshan Sumbaly (LinkedIn) > > * Sangwook Kim (Inervit) > > * Yi A Liu (Intel) > > >The nominated mentors are employees of NASA JPL, LinkedIn, and >Hortonworks. > > * Chris Mattmann - NASA JPL > > * Jakob Homan - LinkedIn > > * Owen O'Malley - Hortonworks > > >=3D Sponsors =3D > >=3D=3D Champion =3D=3D > > * Jakob Homan > > >=3D=3D Nominated Mentors =3D=3D > > * Chris Mattmann > > * Jakob Homan > > * Owen O'Malley > > >=3D=3D Sponsoring Entity =3D=3D > >Apache Incubator --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org