incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [PROPOSAL] Tez to join Apache Incubator
Date Tue, 19 Feb 2013 05:40:49 GMT
This seems like a reasonable project (basically it is the long fabled
map-reduce-reduce or MCR* in google terminology).

But it is *very* heavy with Hortonworks developers.  By my count, the
proportion is over half from HW with only token representation from other
companies:

  13 Hortonworks
   4 Yahoo
   3 Facebook
   2 Microsoft
   1 Cloudera

Shouldn't this be a bit broader to start with?  Or is that an incubation
task?

On Mon, Feb 18, 2013 at 9:29 PM, Arun C Murthy <acm@hortonworks.com> wrote:

> Folks,
>
>  I'd like to propose adding Tez to the Apache Incubator:
> http://wiki.apache.org/incubator/TezProposal
>
>  Essentially, it's the next step to improve projects in the Apache Hadoop
> ecosystem such as Apache Hive, Apache Pig, Cascading (ASL2, but not ASF
> project) by providing a more complex DAG of 'tasks' in a single application
> to process data, there-by providing significant advantages for them.
>
>  During the time I've spent working on MapReduce, I've forever heard
> complaints from Pig/Hive folks about the fact that MapReduce provides a
> very constrained task graph which results in excessive number of MapReduce
> jobs... *smile*. It's very exciting to take this next step, and I would be
> thrilled to have it happen in the ASF - as you can see in the proposal this
> effort has broad support from members of MapReduce, Hive & Pig communities,
> many of whom are eager to participate and have already contributed their
> efforts during the initial prototype.
>
>  I welcome your feedback/discussion and look forward to it!
>
> thanks,
> Arun
> (proposed Champion)
>
> ----
>
> = Tez =
>
> == Abstract ==
> Tez is an effort to develop a generic application framework which can be
> used
> to process arbitrarily complex data-processing tasks and also a re-usable
> set
> of data-processing primitives which can be used by other projects.
>
> == Proposal ==
> Tez is a proposal to develop a generic application which can be used to
> process complex data-processing task DAGs and runs natively on Apache
> Hadoop
> YARN. YARN is a generic resource-management system on which currently
> applications like MapReduce already exist. MapReduce is a specific, and
> constrained, DAG - which is not optimal for several frameworks like Apache
> Hive
> and Apache Pig. Furthermore, we propose to develop a re-usable set of
> libraries of data-processing primitives such as sorting, merging,
> data-shuffling, intermediate data management etc. which are necessary for
> Tez
> which we envision can be used directly by other projects.
>
> == Background ==
> Apache Hadoop MapReduce has emerged as the assembly-language on which other
> frameworks like Apache Pig and Apache Hive have been built. However, it has
> been well accepted that MapReduce produces very constrained task DAGs for
> each
> job which results in Apache Pig and Apache Hive requiring multiple
> MapReduce
> jobs for several queries. By providing a more expressive DAG of tasks for a
> job, Tez attempts to provide significantly enhanced data-processing
> capabilities for projects like Apache Pig, Apache Hive, Cascading etc.
>
> == Rationale ==
> There is an important gap that Tez fulfills in the Apache Hadoop ecosystem
> of
> allowing for more expressive task DAGs for data-processing applications
> such
> as Apache Pig, Apache Hive, Cascading etc.
>
> With emergence of Apache Hadoop YARN, there is a strong need for a
> common DAG application which can then be shared by Apache Pig, Apache Hive,
> Cascading etc.
>
> == Initial Goals ==
> The initial goals for this project are to specify the detailed requirements
> and architecture, and then develop the initial implementation including the
> DAG ApplicationMaster to run natively inside Apache Hadoop YARN.
>
> == Current Status ==
> Significant work has been completed to identify the initial requirements
> and
> define the overall system architecture. There is a patch available in the
> internal Hortonworks git repository which can act as the initial seed.
>
> === Meritocracy ===
> We plan to invest in supporting a meritocracy. We will discuss the
> requirements
> in an open forum. Several companies have already expressed interest in this
> project, and we intend to invite additional developers to participate.
> We will encourage and monitor community participation so that privileges
> can be
> extended to those that contribute.
>
> === Community ===
> The need for a generic DAG application for data processing in the open
> source is
> tremendous, so there is a potential for a very large community. We believe
> that Tez's extensible architecture will further encourage community
> participation.
> Also, related Apache projects (eg, Pig, Hive) have very large and active
> communities, and we expect that over time Tez will also attract a large
> community.
>
> === Core Developers ===
> The developers on the initial committers list include people very
> experienced
> in the Apache Hadoop ecosystem:
>
>  * Alan Gates <gates at apache dot org>
>  * Arun C Murthy <acmurthy at apache dot org>
>  * Ashutosh Chauhan <hashutosh at apache dot org>
>  * Bikas Saha <bikas at apache dot org>
>  * Chris Douglas <cdouglas at apache dot org>
>  * Daryn Sharp <daryn at apache dot org>
>  * Devaraj Das <ddas at apache dot org>
>  * Gopal Vijayaraghavan <gopal at hortonworks dot com>
>  * Gunther Hagleitner <ghagleitner at hortonworks dot com>
>  * Hitesh Shah <hitesh at apache dot org>
>  * Jason Lowe <jlowe at apache dot org>
>  * Jean Xu <jeanxu at facebook dot com>
>  * Jitendra Pandey <jitendra at apache dot org>
>  * Kevin Wilfong <kevinwilfong at apache dot org>
>  * Mike Liddell <mike dot lidell at microsoft dot com>
>  * Namit Jain <namit at apache dot org>
>  * Owen O'Malley <omalley at apache dot org>
>  * Robert Evans <bobby at apache dot org>
>  * Siddharth Seth <sseth at apache dot org>
>  * Tom White <tomwhite at apache dot org>
>  * Thomas Graves <tgraves at apache dot org>
>  * Vikram Dixit <vikram at apache dot org>
>  * Vinod Kumar Vavilapalli <vinodkv at apache dot org>
>
> We realize that though we have significant employer diversity already,
> additional diversity is always better, and we will work
> aggressively to recruit developers from additional companies.
>
> === Alignment ===
> The initial committers strongly believe that a standard task DAG
> application on Apache Hadoop YARN will gain broader adoption as an open
> source,
> community driven project, where the community can contribute not only to
> the
> core components, but also to a growing collection of applications which
> will
> be based on top of Tez. Our hope is that the Apache Hive, Apache Pig,
> Cascading and other communities will find tremendous value in Tez and will
> adopt
> it en masse.
>
> == Known Risks ==
>
> === Orphaned Products ===
> The contributors are leading users and vendors in the Apache Hadoop
> ecosystem,
> with significant open source experience, so the risk of being orphaned is
> relatively low. The project could be at risk if vendors decided to change
> their strategies in the market. In such an event, the current committers
> plan to continue working on the project on their own time, though the
> progress will likely be slower. We plan to mitigate this risk by
> recruiting additional committers.
>
> === Inexperience with Open Source ===
> The initial committers include veteran Apache members (Committers, PMC
> members
> and Apache Members) and other developers who have varying degrees of
> experience
> with open source projects. All have been involved with source code that has
> been released under an open source license, and several also have
> experience
> developing code with an open source development process.
>
> === Homogenous Developers ===
> The initial committers are employed by a number of companies, including
> Cloudera, Facebook, Hortonworks, Microsoft and Yahoo. We are committed to
> recruiting additional committers from other companies based on their
> contributions to the project even though we do have significant diversity
> already.
>
> === Reliance on Salaried Developers ===
> It is expected that Tez development will occur on both salaried time and on
> volunteer time, after hours. The majority of initial committers are paid by
> their employer to contribute to this project. However, they are all
> passionate
> about the project, and we are confident that the project will continue
> even if
> no salaried developers contribute to the project. We are committed to
> recruiting
> additional committers including non-salaried developers.
>
> === Relationships with Other Apache Products ===
> As mentioned in the Alignment section, Tez is closely integrated with
> Hadoop,
> Hive and Pig in a numerous ways. We look forward to collaborating with
> those communities, as well as other Apache communities.
>
> === An Excessive Fascination with the Apache Brand ===
> Tez solves a real need for generic task DAG management in the Apache Hadoop
> ecosystem, something which has been addressed in a very ad hoc manner so
> far
> by multiple Apache projects. Our rationale for developing Tez as an Apache
> project is detailed in the Rationale section. We believe that the Apache
> brand
> and community process will help us attract more contributors to this
> project,
> and help establish ubiquitous APIs.
>
> == Documentation ==
> http://wiki.apache.org/incubator/TezProposal
>
> == Initial Source ==
> Available as a patch.
>
> == Cryptography ==
> Tez will eventually support encryption on the wire. This is not one of the
> initial
> goals, and we do not expect Tez to be a controlled export item due to the
> use
> of encryption.
>
> == Required Resources ==
>
> === Mailing List ===
>  * tez-private
>  * tez-dev
>  * tez-user
>
> === Subversion Directory ===
> Git is the preferred source control system: git://git.apache.org/tez
>
> === Issue Tracking ===
>
> JIRA Tez (TEZ)
>
> == Initial Committers ==
>  * Alan Gates <gates at apache dot org>
>  * Arun C Murthy <acmurthy at apache dot org>
>  * Ashutosh Chauhan <hashutosh at apache dot org>
>  * Bikas Saha <bikas at apache dot org>
>  * Chris Douglas <cdouglas at apache dot org>
>  * Daryn Sharp <daryn at apache dot org>
>  * Devaraj Das <ddas at apache dot org>
>  * Gopal Vijayaraghavan <gopal at hortonworks dot com>
>  * Gunther Hagleitner <ghagleitner at hortonworks dot com>
>  * Hitesh Shah <hitesh at apache dot org>
>  * Jason Lowe <jlowe at apache dot org>
>  * Jean Xu <jeanxu at facebook dot com>
>  * Jitendra Pandey <jitendra at apache dot org>
>  * Kevin Wilfong <kevinwilfong at apache dot org>
>  * Mike Liddell <mike dot lidell at microsoft dot com>
>  * Namit Jain <namit at apache dot org>
>  * Owen O'Malley <omalley at apache dot org>
>  * Robert Evans <bobby at apache dot org>
>  * Siddharth Seth <sseth at apache dot org>
>  * Tom White <tomwhite at apache dot org>
>  * Thomas Graves <tgraves at apache dot org>
>  * Vikram Dixit <vikram at apache dot org>
>  * Vinod Kumar Vavilapalli <vinodkv at apache dot org>
>
> == Affiliations ==
> The initial committers are employees of Cloudera, Facebook, Hortonworks,
> Microsoft  and Yahoo Inc.
>
>  * Alan Gates - Hortonworks
>  * Arun C Murthy - Hortonworks
>  * Ashutosh Chauhan - Hortonworks
>  * Bikas Saha - Hortonworks
>  * Chris Douglas - Microsoft
>  * Daryn Sharp - Yahoo
>  * Devaraj Das - Hortonworks
>  * Gopal Vijayaraghavan - Hortonworks
>  * Gunther Hagleitner - Hortonworks
>  * Hitesh Shah - Hortonworks
>  * Jason Lowe - Yahoo
>  * Jean Xu - Facebook
>  * Jitendra Pandey - Hortonworks
>  * Kevin Wilfong - Facebook
>  * Mike Liddell - Microsoft
>  * Namit Jain - Facebook
>  * Owen O'Malley - Hortonworks
>  * Robert Evans - Yahoo
>  * Siddharth Seth - Hortonworks
>  * Tom White - Cloudera
>  * Thomas Graves - Yahoo
>  * Vikram Dixit - Hortonworks
>  * Vinod Kumar Vavilapalli - Hortonworks
>
> The nominated mentors are employees of Hortonworks,
> NASA JPL and Microsoft.
>
>  * Alan Gates - Hortonworks
>  * Arun C Murthy - Hortonworks
>  * Chris Douglas - Microsoft
>  * Chris Mattman - NASA JPL
>  * Owen O'Malley - Hortonworks
>
> == Sponsors ==
>
> === Champion ===
> Arun C Murthy <acmurthy at apache dot org>
>
> === Nominated Mentors ===
>  * Alan Gates <gates at apache dot org> – Architect at Hortonworks.
> Committer for Pig.
>  * Arun C Murthy <acmurthy at apache dot org> – Architect at
> Hortonworks. Committer for Hadoop.
>  * Chris Douglas <cdouglas at apache dot org> - Sr. Research Engineer at
> Microsoft. Committer for Hadoop.
>  * Chris Mattman <mattmann at apache dot org> - Sr. Computer Scientist,
> NASA JPL. Committer for Nutch, OODT and Tika.
>  * Owen O'Malley <omalley at apache dot org> – Architect at Hortonworks.
> Committer for Hadoop, Ambari.
>
> === Sponsoring Entity ===
> Incubator
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message