incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avik Dey <aviks.em...@gmail.com>
Subject Re: [PROPOSAL] Tez to join Apache Incubator
Date Tue, 19 Feb 2013 08:53:33 GMT
The Tez incubator proposal seems to have a lot in common with the work on
https://issues.apache.org/jira/browse/OOZIE-1178

> It is useful to have a workflow application master, which will be capable
> of running a DAG of jobs. The workflow client submits a DAG request to the
> AM and then the AM will manage the life cycle of this application in terms
> of requesting the needed resources from the RM, and starting, monitoring
> and retrying the application's individual tasks.
>
> Compared to running Oozie with the current MapReduce Application Master,
> these are some of the advantages:
>
>    - Less number of consumed resources, since only one application master
>    will be spawned for the whole workflow.
>    - Reuse of resources, since the same resources can be used by multiple
>    consecutive jobs in the workflow (no need to request/wait for resources for
>    every individual job from the central RM).
>    - More optimization opportunities in terms of collective resource
>    requests.
>    - Optimization opportunities in terms of rewriting and composing jobs
>    in the workflow (e.g. pushing down Mappers).
>    - This Application Master can be reused/extended by higher systems
>    like Pig and hive to provide an optimized way of running their workflows.
>
> So, is this the 'yapp' proposal that was discussed on that thread?

~avik


On Mon, Feb 18, 2013 at 9:40 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> This seems like a reasonable project (basically it is the long fabled
> map-reduce-reduce or MCR* in google terminology).
>
> But it is *very* heavy with Hortonworks developers.  By my count, the
> proportion is over half from HW with only token representation from other
> companies:
>
>   13 Hortonworks
>    4 Yahoo
>    3 Facebook
>    2 Microsoft
>    1 Cloudera
>
> Shouldn't this be a bit broader to start with?  Or is that an incubation
> task?
>
> On Mon, Feb 18, 2013 at 9:29 PM, Arun C Murthy <acm@hortonworks.com>
> wrote:
>
> > Folks,
> >
> >  I'd like to propose adding Tez to the Apache Incubator:
> > http://wiki.apache.org/incubator/TezProposal
> >
> >  Essentially, it's the next step to improve projects in the Apache Hadoop
> > ecosystem such as Apache Hive, Apache Pig, Cascading (ASL2, but not ASF
> > project) by providing a more complex DAG of 'tasks' in a single
> application
> > to process data, there-by providing significant advantages for them.
> >
> >  During the time I've spent working on MapReduce, I've forever heard
> > complaints from Pig/Hive folks about the fact that MapReduce provides a
> > very constrained task graph which results in excessive number of
> MapReduce
> > jobs... *smile*. It's very exciting to take this next step, and I would
> be
> > thrilled to have it happen in the ASF - as you can see in the proposal
> this
> > effort has broad support from members of MapReduce, Hive & Pig
> communities,
> > many of whom are eager to participate and have already contributed their
> > efforts during the initial prototype.
> >
> >  I welcome your feedback/discussion and look forward to it!
> >
> > thanks,
> > Arun
> > (proposed Champion)
> >
> > ----
> >
> > = Tez =
> >
> > == Abstract ==
> > Tez is an effort to develop a generic application framework which can be
> > used
> > to process arbitrarily complex data-processing tasks and also a re-usable
> > set
> > of data-processing primitives which can be used by other projects.
> >
> > == Proposal ==
> > Tez is a proposal to develop a generic application which can be used to
> > process complex data-processing task DAGs and runs natively on Apache
> > Hadoop
> > YARN. YARN is a generic resource-management system on which currently
> > applications like MapReduce already exist. MapReduce is a specific, and
> > constrained, DAG - which is not optimal for several frameworks like
> Apache
> > Hive
> > and Apache Pig. Furthermore, we propose to develop a re-usable set of
> > libraries of data-processing primitives such as sorting, merging,
> > data-shuffling, intermediate data management etc. which are necessary for
> > Tez
> > which we envision can be used directly by other projects.
> >
> > == Background ==
> > Apache Hadoop MapReduce has emerged as the assembly-language on which
> other
> > frameworks like Apache Pig and Apache Hive have been built. However, it
> has
> > been well accepted that MapReduce produces very constrained task DAGs for
> > each
> > job which results in Apache Pig and Apache Hive requiring multiple
> > MapReduce
> > jobs for several queries. By providing a more expressive DAG of tasks
> for a
> > job, Tez attempts to provide significantly enhanced data-processing
> > capabilities for projects like Apache Pig, Apache Hive, Cascading etc.
> >
> > == Rationale ==
> > There is an important gap that Tez fulfills in the Apache Hadoop
> ecosystem
> > of
> > allowing for more expressive task DAGs for data-processing applications
> > such
> > as Apache Pig, Apache Hive, Cascading etc.
> >
> > With emergence of Apache Hadoop YARN, there is a strong need for a
> > common DAG application which can then be shared by Apache Pig, Apache
> Hive,
> > Cascading etc.
> >
> > == Initial Goals ==
> > The initial goals for this project are to specify the detailed
> requirements
> > and architecture, and then develop the initial implementation including
> the
> > DAG ApplicationMaster to run natively inside Apache Hadoop YARN.
> >
> > == Current Status ==
> > Significant work has been completed to identify the initial requirements
> > and
> > define the overall system architecture. There is a patch available in the
> > internal Hortonworks git repository which can act as the initial seed.
> >
> > === Meritocracy ===
> > We plan to invest in supporting a meritocracy. We will discuss the
> > requirements
> > in an open forum. Several companies have already expressed interest in
> this
> > project, and we intend to invite additional developers to participate.
> > We will encourage and monitor community participation so that privileges
> > can be
> > extended to those that contribute.
> >
> > === Community ===
> > The need for a generic DAG application for data processing in the open
> > source is
> > tremendous, so there is a potential for a very large community. We
> believe
> > that Tez's extensible architecture will further encourage community
> > participation.
> > Also, related Apache projects (eg, Pig, Hive) have very large and active
> > communities, and we expect that over time Tez will also attract a large
> > community.
> >
> > === Core Developers ===
> > The developers on the initial committers list include people very
> > experienced
> > in the Apache Hadoop ecosystem:
> >
> >  * Alan Gates <gates at apache dot org>
> >  * Arun C Murthy <acmurthy at apache dot org>
> >  * Ashutosh Chauhan <hashutosh at apache dot org>
> >  * Bikas Saha <bikas at apache dot org>
> >  * Chris Douglas <cdouglas at apache dot org>
> >  * Daryn Sharp <daryn at apache dot org>
> >  * Devaraj Das <ddas at apache dot org>
> >  * Gopal Vijayaraghavan <gopal at hortonworks dot com>
> >  * Gunther Hagleitner <ghagleitner at hortonworks dot com>
> >  * Hitesh Shah <hitesh at apache dot org>
> >  * Jason Lowe <jlowe at apache dot org>
> >  * Jean Xu <jeanxu at facebook dot com>
> >  * Jitendra Pandey <jitendra at apache dot org>
> >  * Kevin Wilfong <kevinwilfong at apache dot org>
> >  * Mike Liddell <mike dot lidell at microsoft dot com>
> >  * Namit Jain <namit at apache dot org>
> >  * Owen O'Malley <omalley at apache dot org>
> >  * Robert Evans <bobby at apache dot org>
> >  * Siddharth Seth <sseth at apache dot org>
> >  * Tom White <tomwhite at apache dot org>
> >  * Thomas Graves <tgraves at apache dot org>
> >  * Vikram Dixit <vikram at apache dot org>
> >  * Vinod Kumar Vavilapalli <vinodkv at apache dot org>
> >
> > We realize that though we have significant employer diversity already,
> > additional diversity is always better, and we will work
> > aggressively to recruit developers from additional companies.
> >
> > === Alignment ===
> > The initial committers strongly believe that a standard task DAG
> > application on Apache Hadoop YARN will gain broader adoption as an open
> > source,
> > community driven project, where the community can contribute not only to
> > the
> > core components, but also to a growing collection of applications which
> > will
> > be based on top of Tez. Our hope is that the Apache Hive, Apache Pig,
> > Cascading and other communities will find tremendous value in Tez and
> will
> > adopt
> > it en masse.
> >
> > == Known Risks ==
> >
> > === Orphaned Products ===
> > The contributors are leading users and vendors in the Apache Hadoop
> > ecosystem,
> > with significant open source experience, so the risk of being orphaned is
> > relatively low. The project could be at risk if vendors decided to change
> > their strategies in the market. In such an event, the current committers
> > plan to continue working on the project on their own time, though the
> > progress will likely be slower. We plan to mitigate this risk by
> > recruiting additional committers.
> >
> > === Inexperience with Open Source ===
> > The initial committers include veteran Apache members (Committers, PMC
> > members
> > and Apache Members) and other developers who have varying degrees of
> > experience
> > with open source projects. All have been involved with source code that
> has
> > been released under an open source license, and several also have
> > experience
> > developing code with an open source development process.
> >
> > === Homogenous Developers ===
> > The initial committers are employed by a number of companies, including
> > Cloudera, Facebook, Hortonworks, Microsoft and Yahoo. We are committed to
> > recruiting additional committers from other companies based on their
> > contributions to the project even though we do have significant diversity
> > already.
> >
> > === Reliance on Salaried Developers ===
> > It is expected that Tez development will occur on both salaried time and
> on
> > volunteer time, after hours. The majority of initial committers are paid
> by
> > their employer to contribute to this project. However, they are all
> > passionate
> > about the project, and we are confident that the project will continue
> > even if
> > no salaried developers contribute to the project. We are committed to
> > recruiting
> > additional committers including non-salaried developers.
> >
> > === Relationships with Other Apache Products ===
> > As mentioned in the Alignment section, Tez is closely integrated with
> > Hadoop,
> > Hive and Pig in a numerous ways. We look forward to collaborating with
> > those communities, as well as other Apache communities.
> >
> > === An Excessive Fascination with the Apache Brand ===
> > Tez solves a real need for generic task DAG management in the Apache
> Hadoop
> > ecosystem, something which has been addressed in a very ad hoc manner so
> > far
> > by multiple Apache projects. Our rationale for developing Tez as an
> Apache
> > project is detailed in the Rationale section. We believe that the Apache
> > brand
> > and community process will help us attract more contributors to this
> > project,
> > and help establish ubiquitous APIs.
> >
> > == Documentation ==
> > http://wiki.apache.org/incubator/TezProposal
> >
> > == Initial Source ==
> > Available as a patch.
> >
> > == Cryptography ==
> > Tez will eventually support encryption on the wire. This is not one of
> the
> > initial
> > goals, and we do not expect Tez to be a controlled export item due to the
> > use
> > of encryption.
> >
> > == Required Resources ==
> >
> > === Mailing List ===
> >  * tez-private
> >  * tez-dev
> >  * tez-user
> >
> > === Subversion Directory ===
> > Git is the preferred source control system: git://git.apache.org/tez
> >
> > === Issue Tracking ===
> >
> > JIRA Tez (TEZ)
> >
> > == Initial Committers ==
> >  * Alan Gates <gates at apache dot org>
> >  * Arun C Murthy <acmurthy at apache dot org>
> >  * Ashutosh Chauhan <hashutosh at apache dot org>
> >  * Bikas Saha <bikas at apache dot org>
> >  * Chris Douglas <cdouglas at apache dot org>
> >  * Daryn Sharp <daryn at apache dot org>
> >  * Devaraj Das <ddas at apache dot org>
> >  * Gopal Vijayaraghavan <gopal at hortonworks dot com>
> >  * Gunther Hagleitner <ghagleitner at hortonworks dot com>
> >  * Hitesh Shah <hitesh at apache dot org>
> >  * Jason Lowe <jlowe at apache dot org>
> >  * Jean Xu <jeanxu at facebook dot com>
> >  * Jitendra Pandey <jitendra at apache dot org>
> >  * Kevin Wilfong <kevinwilfong at apache dot org>
> >  * Mike Liddell <mike dot lidell at microsoft dot com>
> >  * Namit Jain <namit at apache dot org>
> >  * Owen O'Malley <omalley at apache dot org>
> >  * Robert Evans <bobby at apache dot org>
> >  * Siddharth Seth <sseth at apache dot org>
> >  * Tom White <tomwhite at apache dot org>
> >  * Thomas Graves <tgraves at apache dot org>
> >  * Vikram Dixit <vikram at apache dot org>
> >  * Vinod Kumar Vavilapalli <vinodkv at apache dot org>
> >
> > == Affiliations ==
> > The initial committers are employees of Cloudera, Facebook, Hortonworks,
> > Microsoft  and Yahoo Inc.
> >
> >  * Alan Gates - Hortonworks
> >  * Arun C Murthy - Hortonworks
> >  * Ashutosh Chauhan - Hortonworks
> >  * Bikas Saha - Hortonworks
> >  * Chris Douglas - Microsoft
> >  * Daryn Sharp - Yahoo
> >  * Devaraj Das - Hortonworks
> >  * Gopal Vijayaraghavan - Hortonworks
> >  * Gunther Hagleitner - Hortonworks
> >  * Hitesh Shah - Hortonworks
> >  * Jason Lowe - Yahoo
> >  * Jean Xu - Facebook
> >  * Jitendra Pandey - Hortonworks
> >  * Kevin Wilfong - Facebook
> >  * Mike Liddell - Microsoft
> >  * Namit Jain - Facebook
> >  * Owen O'Malley - Hortonworks
> >  * Robert Evans - Yahoo
> >  * Siddharth Seth - Hortonworks
> >  * Tom White - Cloudera
> >  * Thomas Graves - Yahoo
> >  * Vikram Dixit - Hortonworks
> >  * Vinod Kumar Vavilapalli - Hortonworks
> >
> > The nominated mentors are employees of Hortonworks,
> > NASA JPL and Microsoft.
> >
> >  * Alan Gates - Hortonworks
> >  * Arun C Murthy - Hortonworks
> >  * Chris Douglas - Microsoft
> >  * Chris Mattman - NASA JPL
> >  * Owen O'Malley - Hortonworks
> >
> > == Sponsors ==
> >
> > === Champion ===
> > Arun C Murthy <acmurthy at apache dot org>
> >
> > === Nominated Mentors ===
> >  * Alan Gates <gates at apache dot org> – Architect at Hortonworks.
> > Committer for Pig.
> >  * Arun C Murthy <acmurthy at apache dot org> – Architect at
> > Hortonworks. Committer for Hadoop.
> >  * Chris Douglas <cdouglas at apache dot org> - Sr. Research Engineer at
> > Microsoft. Committer for Hadoop.
> >  * Chris Mattman <mattmann at apache dot org> - Sr. Computer Scientist,
> > NASA JPL. Committer for Nutch, OODT and Tika.
> >  * Owen O'Malley <omalley at apache dot org> – Architect at
> Hortonworks.
> > Committer for Hadoop, Ambari.
> >
> > === Sponsoring Entity ===
> > Incubator
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message