incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <omal...@apache.org>
Subject Re: [VOTE] Accept Tajo into the Apache Incubator
Date Fri, 01 Mar 2013 07:01:22 GMT
+1 (binding)


On Thu, Feb 28, 2013 at 10:34 PM, Suresh Marru <smarru@apache.org> wrote:

> + 1 (binding).
>
> Happy Incubating,
> Suresh
>
> On Feb 28, 2013, at 10:11 AM, Hyunsik Choi <hyunsik@apache.org> wrote:
>
> > Hi Folks,
> >
> > I'd like to call a VOTE for acceptance of Tajo into the Apache incubator.
> > The vote will close on Mar 7 at 6:00 PM (PST).
> >
> > [] +1 Accept Tajo into the Apache incubator
> > [] +0 Don't care.
> > [] -1 Don't accept Tajo into the incubator because...
> >
> > Full proposal is pasted at the bottom on this email, and the
> corresponding
> > wiki is http://wiki.apache.org/incubator/TajoProposal.
> >
> > Only VOTEs from Incubator PMC members are binding, but all are welcome to
> > express their thoughts.
> >
> > Thanks,
> > Hyunsik
> >
> > PS: From the initial discussion, the main changes are that I've added 4
> new
> > committers. Also, I've revised some description of Known Risks because
> the
> > initial committers have been diverse.
> >
> > ----------------
> > Tajo Proposal
> >
> > = Abstract =
> >
> > Tajo is a distributed data warehouse system for Hadoop.
> >
> >
> > = Proposal =
> >
> > Tajo is a relational and distributed data warehouse system for Hadoop.
> Tajo
> > is designed for low-latency and scalable ad-hoc queries, online
> aggregation
> > and ETL on large-data sets by leveraging advanced database techniques. It
> > supports SQL standards. Tajo is inspired by Dryad, MapReduce, Dremel,
> > Scope, and parallel databases. Tajo uses HDFS as a primary storage layer,
> > and it has its own query engine which allows direct control of
> distributed
> > execution and data flow. As a result, Tajo has a variety of query
> > evaluation strategies and more optimization opportunities. In addition,
> > Tajo will have a native columnar execution and and its optimizer. Tajo
> will
> > be an alternative choice to Hive/Pig on the top of MapReduce.
> >
> >
> > = Background =
> >
> > Big data analysis has gained much attention in the industrial. Open
> source
> > communities have proposed scalable and distributed solutions for ad-hoc
> > queries on big data. However, there is still room for improvement.
> Markets
> > need more faster and efficient solutions. Recently, some alternatives
> > (e.g., Cloudera's Impala and Amazon Redshift) have come out.
> >
> >
> > = Rationale =
> >
> > There are a variety of open source distributed execution engines (e.g.,
> > hive, and pig) running on the top of MapReduce. They are limited by MR
> > framework. They cannot directly control distributed execution and data
> > flow, and they just use MR framework. So, they have limited query
> > evaluation strategies and optimization opportunities. It is hard for them
> > to be optimized for a certain type of data processing.
> >
> >
> > = Initial Goals =
> >
> > The initial goal is to write more documents to describe Tajo's internal.
> It
> > will be helpful to recruit more committers and to build a solid
> community.
> > Then, we will make milestones for short/long term plans.
> >
> >
> > = Current Status =
> >
> > Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
> > selection, projection, group-by, join, union and sort) except for nested
> > queries. Tajo provides various row/column storage formats, such as CSV,
> > RowFile (a row-store file we have implemented), RCFile, and Trevni, and
> it
> > also has a rudimentary ETL feature to transform one data format to
> another
> > data format. In addition, Tajo provides hash and range repartitions. By
> > using both repartition methods, Tajo processes aggregation, join, and
> sort
> > queries over a number of cluster nodes. To evaluate the performance, we
> > have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
> >
> >
> > == Meritocracy ==
> >
> > We will discuss the milestone and the future plan in an open forum. We
> plan
> > to encourage an environment that supports a meritocracy. The contributors
> > will have different privileges according to their contributions.
> >
> >
> > == Community ==
> >
> > Big data analysis has gained attention from open source communities,
> > industrial and academic areas. Some projects related to Hadoop already
> have
> > very large and active communities. We expect that Tajo also will
> establish
> > an active community. Since Tajo already works for some features and is in
> > the alpha stage, it will attract a large community soon.
> >
> >
> > == Core Developers ==
> >
> > Core developers are a diverse group of developers, many of which are very
> > experienced in open source and the Apache Hadoop ecosystem.
> >
> > * Eli Reisman <ereisman AT apache DOT org>
> >
> > * Henry Saputra <hsaputra AT apache DOT org>
> >
> > * Hyunsik Choi <hyunsik AT apache DOT org>
> >
> > * Jae Hwa Jung <jhjung AT gruter DOT com>
> >
> > * Jihoon Son <ghoonson AT gmail DOT com>
> >
> > * Jin Ho Kim <jhkim AT gruter DOT com>
> >
> > * Roshan Sumbaly <rsumbaly AT gmail DOT com>
> >
> > * Sangwook Kim <swkim AT inervit DOT com>
> >
> > * Yi A Liu <yi DOT a DOT liu AT intel DOT com>
> >
> >
> > == Alignment ==
> >
> > Tajo employs Apache Hadoop Yarn as a resource management platform for
> large
> > clusters. It uses HDFS as a primary storage layer. It already supports
> > Hadoop-related data formats (RCFile, Trevni) and will support ORC file.
> In
> > addition, we have a plan to integrate Tajo with other products of Hadoop
> > ecosystem. Tajo's modules are well organized, and these modules can also
> be
> > used for other projects.
> >
> >
> > = Known Risks =
> >
> > == Orphaned Products ==
> >
> > Most of codes have been developed by only two core developers, who are
> > Hyunsik Choi and Jihoon Son. It may be a risk of being orphaned. However,
> > they are guaranteed to have enough time to develop Tajo for years. As you
> > can see the commit history, they have participated in this project for
> > about two years. In addition, the initial committers are diverse, and
> Tajo
> > has been supported by two IT companies in South Korea. So, the risk of
> > being orphaned is very low. Later, we will be eager to recruit additional
> > committers in order to eliminate this risk.
> >
> >
> > == Inexperience with Open Source ==
> >
> > Most of the initial committers have experience working on open source
> > projects. In particular, Eli, Henry, and Hyunsik have experience as
> > committers and PMC members on other Apache projects.
> >
> >
> > == Homogeneous Developers ==
> >
> > Although they are a diverse group of developers, what a half of core
> > developers are in South Korea may be a risk. This is because their
> offline
> > activities are limited due to their location. Since we surely recognize
> > this risk, we will write more complete documents and presentation
> materials
> > in order to disseminate Tajo's internal and users guide. In addition, to
> > mitigate this risk we will be eager to recruit additional committers
> around
> > the world.
> >
> >
> > == Reliance on Salaried Developers ==
> >
> > It is expected that Tajo development will occur on both salaried time and
> > on volunteer time. Hyunsik and Jihoon belong to Database lab., Korea
> Univ.
> > They will be paid by the lab to contribute Tajo for years. Jin Ho and
> > Sangwook are paid by their employer to contribute to this project. Other
> > developers will contribute to this project on volunteer time. In
> addition,
> > we will be eager to recruit additional committers including salaried and
> > non-salaried developers.
> >
> >
> > == Relationships with Other Apache Products ==
> >
> > Tajo has some overlapping function with Apache Incubator Drill. However,
> > Tajo is even more mature than Drill. In addition, there are some
> > significant differences. Drill is a distributed system specialized for
> > low-latency query processing by using column operations and intermediate
> > data streaming. Drill has very simple query optimizer. However, some
> > queries including big-big table join and sort are not available in that
> > manner. Drill will support some of query types.
> >
> > In contrast, Tajo has advanced query optimization system. Tajo mainly
> aims
> > at scalable and efficient processing on all query types. By using the
> query
> > optimizer, Tajo will only chase low latency query processing for some
> query
> > types that can be executed in online aggregation manner.
> >
> > Besides, Tez has some overlapping functions with Tajo. However, Tez is in
> > the pre-alpha stage and may be a prototype. When Tez becomes feasible,
> Tajo
> > could use Tez as an underlying framework according to the applicability.
> > However, Tajo will still use its row/native columnar execution engine and
> > its optimizer. Tajo may be potentially the first application of Tez.
> >
> >
> > == A Excessive Fascination with the Apache Brand ==
> >
> > We believe that the Apache brand will help us to find contributors and to
> > grow the community. The community and development process will make this
> > project more stable and help establish ubiquitous APIs. In addition, Tajo
> > depends other project in Apache Hadoop ecosystem. We expect that
> > cooperative work occurs with other projects in the same place.
> >
> >
> > = Documentation =
> >
> > Tajo's demonstration paper was accepted to IEEE ICDE 2013. Since this
> > conference will be held in April 2013, we cannot publicly show the paper.
> > Instead, we attached some presentation material. Checkout this slide (
> > http://www.slideshare.net/hyunsikchoi/tajo-intro)
> >
> > In addition, some documents (e.g., getting started) are available at
> > http://tajo-project.github.com/tajo/.
> >
> >
> > = Initial Source =
> >
> > The initial source code has been developed in the Database Lab. Korea
> Univ.
> > This is implemented in Java and has almost 100,000 lines except for
> parser
> > and protobuf generated codes. Currently, initial source code is already
> > available on GitHub at [[https://github.com/tajo-project/tajo]].
> >
> >
> > = Source and Intellectual Property Submission Plan =
> >
> > We intend the entire code base to be licensed under the Apache License,
> > Version 2.0.
> >
> >
> > = External Dependencies =
> >
> > The required dependencies are all Apache compatible licenses. The
> following
> > components with non-Apache licenses are enumerated:
> >
> > * Google Guava
> >
> > * Google Protocol Buffer
> >
> > * Antlr
> >
> > * Mockito
> >
> > * JLine2
> >
> >
> > = Cryptography =
> >
> > Tajo will depend on secure Hadoop that can optionally use Kerberos.
> >
> >
> > = Required Resources =
> >
> > == Mailling List ==
> >
> > * tajo-private (with moderated subscriptions)
> >
> > * tajo-dev
> >
> > * tajo-commits
> >
> >
> > == Subversion Directory ==
> >
> > https://git-wip-us.apache.org/repos/asf/tajo.git
> >
> >
> > == Issue Tracking ==
> >
> > Jira Tajo (TAJO)
> >
> >
> > == Other Resources ==
> >
> > * Continuous Integration
> >
> >   * Jenkins
> >
> > * Wiki
> >
> >   * http://wiki.apache.org/tajo
> >
> >
> > = Initial Committers =
> >
> > * Eli Reisman <ereisman AT apache DOT org>
> >
> > * Henry Saputra <hsaputra AT apache DOT org>
> >
> > * Hyunsik Choi <hyunsik AT apache DOT org>
> >
> > * Jae Hwa Jung <jhjung AT gruter DOT com>
> >
> > * Jihoon Son <ghoonson AT gmail DOT com>
> >
> > * Jin Ho Kim <jhkim AT gruter DOT com>
> >
> > * Roshan Sumbaly <rsumbaly AT gmail DOT com>
> >
> > * Sangwook Kim <swkim AT inervit DOT com>
> >
> > * Yi A Liu <yi DOT a DOT liu AT intel DOT com>
> >
> >
> > = Affiliations =
> >
> > * Eli Reisman (Hortonworks)
> >
> > * Henry Saputra (Platfora)
> >
> > * Hyunsik Choi (Database Lab., Korea University)
> >
> > * Jae Hwa Jung (Gruter)
> >
> > * Jihoon Son (Database Lab., Korea University)
> >
> > * Jin Ho Kim (Gruter)
> >
> > * Roshan Sumbaly (LinkedIn)
> >
> > * Sangwook Kim (Inervit)
> >
> > * Yi A Liu (Intel)
> >
> >
> > The nominated mentors are employees of NASA JPL, LinkedIn, and
> Hortonworks.
> >
> > * Chris Mattmann - NASA JPL
> >
> > * Jakob Homan - LinkedIn
> >
> > * Owen O'Malley - Hortonworks
> >
> >
> > = Sponsors =
> >
> > == Champion ==
> >
> > * Jakob Homan <ghoman AT apache DOT org>
> >
> >
> > == Nominated Mentors ==
> >
> > * Chris Mattmann <chris DOT a DOT mattmann AT jpl DOT nasa DOT gov>
> >
> > * Jakob Homan <jghoman AT apache DOT org>
> >
> > * Owen O'Malley <omalley AT apache DOT org>
> >
> >
> > == Sponsoring Entity ==
> >
> > Apache Incubator
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message