incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: [PROPOSAL] Tez to join Apache Incubator
Date Tue, 19 Feb 2013 09:23:33 GMT
Hi,

This proposal looks very interesting to me. What exactly is the scope of
Tez? Does it aim to be a general data flow system such as
Stratosphere[1] or Hyracks[2]? Or will it still be executing Map and
Reduce tasks, that are composable in a more flexible manner?

Best,
Sebastian

[1] http://dl.acm.org/citation.cfm?id=1807148
https://www.stratosphere.eu/sites/default/files/papers/NephelePACTs_10.pdf

[2]
http://dl.acm.org/citation.cfm?id=2005632
http://asterix.ics.uci.edu/pub/Hyracks.pdf

On 19.02.2013 09:53, Avik Dey wrote:
> The Tez incubator proposal seems to have a lot in common with the work on
> https://issues.apache.org/jira/browse/OOZIE-1178
> 
>> It is useful to have a workflow application master, which will be capable
>> of running a DAG of jobs. The workflow client submits a DAG request to the
>> AM and then the AM will manage the life cycle of this application in terms
>> of requesting the needed resources from the RM, and starting, monitoring
>> and retrying the application's individual tasks.
>>
>> Compared to running Oozie with the current MapReduce Application Master,
>> these are some of the advantages:
>>
>>    - Less number of consumed resources, since only one application master
>>    will be spawned for the whole workflow.
>>    - Reuse of resources, since the same resources can be used by multiple
>>    consecutive jobs in the workflow (no need to request/wait for resources for
>>    every individual job from the central RM).
>>    - More optimization opportunities in terms of collective resource
>>    requests.
>>    - Optimization opportunities in terms of rewriting and composing jobs
>>    in the workflow (e.g. pushing down Mappers).
>>    - This Application Master can be reused/extended by higher systems
>>    like Pig and hive to provide an optimized way of running their workflows.
>>
>> So, is this the 'yapp' proposal that was discussed on that thread?
> 
> ~avik
> 
> 
> On Mon, Feb 18, 2013 at 9:40 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> 
>> This seems like a reasonable project (basically it is the long fabled
>> map-reduce-reduce or MCR* in google terminology).
>>
>> But it is *very* heavy with Hortonworks developers.  By my count, the
>> proportion is over half from HW with only token representation from other
>> companies:
>>
>>   13 Hortonworks
>>    4 Yahoo
>>    3 Facebook
>>    2 Microsoft
>>    1 Cloudera
>>
>> Shouldn't this be a bit broader to start with?  Or is that an incubation
>> task?
>>
>> On Mon, Feb 18, 2013 at 9:29 PM, Arun C Murthy <acm@hortonworks.com>
>> wrote:
>>
>>> Folks,
>>>
>>>  I'd like to propose adding Tez to the Apache Incubator:
>>> http://wiki.apache.org/incubator/TezProposal
>>>
>>>  Essentially, it's the next step to improve projects in the Apache Hadoop
>>> ecosystem such as Apache Hive, Apache Pig, Cascading (ASL2, but not ASF
>>> project) by providing a more complex DAG of 'tasks' in a single
>> application
>>> to process data, there-by providing significant advantages for them.
>>>
>>>  During the time I've spent working on MapReduce, I've forever heard
>>> complaints from Pig/Hive folks about the fact that MapReduce provides a
>>> very constrained task graph which results in excessive number of
>> MapReduce
>>> jobs... *smile*. It's very exciting to take this next step, and I would
>> be
>>> thrilled to have it happen in the ASF - as you can see in the proposal
>> this
>>> effort has broad support from members of MapReduce, Hive & Pig
>> communities,
>>> many of whom are eager to participate and have already contributed their
>>> efforts during the initial prototype.
>>>
>>>  I welcome your feedback/discussion and look forward to it!
>>>
>>> thanks,
>>> Arun
>>> (proposed Champion)
>>>
>>> ----
>>>
>>> = Tez =
>>>
>>> == Abstract ==
>>> Tez is an effort to develop a generic application framework which can be
>>> used
>>> to process arbitrarily complex data-processing tasks and also a re-usable
>>> set
>>> of data-processing primitives which can be used by other projects.
>>>
>>> == Proposal ==
>>> Tez is a proposal to develop a generic application which can be used to
>>> process complex data-processing task DAGs and runs natively on Apache
>>> Hadoop
>>> YARN. YARN is a generic resource-management system on which currently
>>> applications like MapReduce already exist. MapReduce is a specific, and
>>> constrained, DAG - which is not optimal for several frameworks like
>> Apache
>>> Hive
>>> and Apache Pig. Furthermore, we propose to develop a re-usable set of
>>> libraries of data-processing primitives such as sorting, merging,
>>> data-shuffling, intermediate data management etc. which are necessary for
>>> Tez
>>> which we envision can be used directly by other projects.
>>>
>>> == Background ==
>>> Apache Hadoop MapReduce has emerged as the assembly-language on which
>> other
>>> frameworks like Apache Pig and Apache Hive have been built. However, it
>> has
>>> been well accepted that MapReduce produces very constrained task DAGs for
>>> each
>>> job which results in Apache Pig and Apache Hive requiring multiple
>>> MapReduce
>>> jobs for several queries. By providing a more expressive DAG of tasks
>> for a
>>> job, Tez attempts to provide significantly enhanced data-processing
>>> capabilities for projects like Apache Pig, Apache Hive, Cascading etc.
>>>
>>> == Rationale ==
>>> There is an important gap that Tez fulfills in the Apache Hadoop
>> ecosystem
>>> of
>>> allowing for more expressive task DAGs for data-processing applications
>>> such
>>> as Apache Pig, Apache Hive, Cascading etc.
>>>
>>> With emergence of Apache Hadoop YARN, there is a strong need for a
>>> common DAG application which can then be shared by Apache Pig, Apache
>> Hive,
>>> Cascading etc.
>>>
>>> == Initial Goals ==
>>> The initial goals for this project are to specify the detailed
>> requirements
>>> and architecture, and then develop the initial implementation including
>> the
>>> DAG ApplicationMaster to run natively inside Apache Hadoop YARN.
>>>
>>> == Current Status ==
>>> Significant work has been completed to identify the initial requirements
>>> and
>>> define the overall system architecture. There is a patch available in the
>>> internal Hortonworks git repository which can act as the initial seed.
>>>
>>> === Meritocracy ===
>>> We plan to invest in supporting a meritocracy. We will discuss the
>>> requirements
>>> in an open forum. Several companies have already expressed interest in
>> this
>>> project, and we intend to invite additional developers to participate.
>>> We will encourage and monitor community participation so that privileges
>>> can be
>>> extended to those that contribute.
>>>
>>> === Community ===
>>> The need for a generic DAG application for data processing in the open
>>> source is
>>> tremendous, so there is a potential for a very large community. We
>> believe
>>> that Tez's extensible architecture will further encourage community
>>> participation.
>>> Also, related Apache projects (eg, Pig, Hive) have very large and active
>>> communities, and we expect that over time Tez will also attract a large
>>> community.
>>>
>>> === Core Developers ===
>>> The developers on the initial committers list include people very
>>> experienced
>>> in the Apache Hadoop ecosystem:
>>>
>>>  * Alan Gates <gates at apache dot org>
>>>  * Arun C Murthy <acmurthy at apache dot org>
>>>  * Ashutosh Chauhan <hashutosh at apache dot org>
>>>  * Bikas Saha <bikas at apache dot org>
>>>  * Chris Douglas <cdouglas at apache dot org>
>>>  * Daryn Sharp <daryn at apache dot org>
>>>  * Devaraj Das <ddas at apache dot org>
>>>  * Gopal Vijayaraghavan <gopal at hortonworks dot com>
>>>  * Gunther Hagleitner <ghagleitner at hortonworks dot com>
>>>  * Hitesh Shah <hitesh at apache dot org>
>>>  * Jason Lowe <jlowe at apache dot org>
>>>  * Jean Xu <jeanxu at facebook dot com>
>>>  * Jitendra Pandey <jitendra at apache dot org>
>>>  * Kevin Wilfong <kevinwilfong at apache dot org>
>>>  * Mike Liddell <mike dot lidell at microsoft dot com>
>>>  * Namit Jain <namit at apache dot org>
>>>  * Owen O'Malley <omalley at apache dot org>
>>>  * Robert Evans <bobby at apache dot org>
>>>  * Siddharth Seth <sseth at apache dot org>
>>>  * Tom White <tomwhite at apache dot org>
>>>  * Thomas Graves <tgraves at apache dot org>
>>>  * Vikram Dixit <vikram at apache dot org>
>>>  * Vinod Kumar Vavilapalli <vinodkv at apache dot org>
>>>
>>> We realize that though we have significant employer diversity already,
>>> additional diversity is always better, and we will work
>>> aggressively to recruit developers from additional companies.
>>>
>>> === Alignment ===
>>> The initial committers strongly believe that a standard task DAG
>>> application on Apache Hadoop YARN will gain broader adoption as an open
>>> source,
>>> community driven project, where the community can contribute not only to
>>> the
>>> core components, but also to a growing collection of applications which
>>> will
>>> be based on top of Tez. Our hope is that the Apache Hive, Apache Pig,
>>> Cascading and other communities will find tremendous value in Tez and
>> will
>>> adopt
>>> it en masse.
>>>
>>> == Known Risks ==
>>>
>>> === Orphaned Products ===
>>> The contributors are leading users and vendors in the Apache Hadoop
>>> ecosystem,
>>> with significant open source experience, so the risk of being orphaned is
>>> relatively low. The project could be at risk if vendors decided to change
>>> their strategies in the market. In such an event, the current committers
>>> plan to continue working on the project on their own time, though the
>>> progress will likely be slower. We plan to mitigate this risk by
>>> recruiting additional committers.
>>>
>>> === Inexperience with Open Source ===
>>> The initial committers include veteran Apache members (Committers, PMC
>>> members
>>> and Apache Members) and other developers who have varying degrees of
>>> experience
>>> with open source projects. All have been involved with source code that
>> has
>>> been released under an open source license, and several also have
>>> experience
>>> developing code with an open source development process.
>>>
>>> === Homogenous Developers ===
>>> The initial committers are employed by a number of companies, including
>>> Cloudera, Facebook, Hortonworks, Microsoft and Yahoo. We are committed to
>>> recruiting additional committers from other companies based on their
>>> contributions to the project even though we do have significant diversity
>>> already.
>>>
>>> === Reliance on Salaried Developers ===
>>> It is expected that Tez development will occur on both salaried time and
>> on
>>> volunteer time, after hours. The majority of initial committers are paid
>> by
>>> their employer to contribute to this project. However, they are all
>>> passionate
>>> about the project, and we are confident that the project will continue
>>> even if
>>> no salaried developers contribute to the project. We are committed to
>>> recruiting
>>> additional committers including non-salaried developers.
>>>
>>> === Relationships with Other Apache Products ===
>>> As mentioned in the Alignment section, Tez is closely integrated with
>>> Hadoop,
>>> Hive and Pig in a numerous ways. We look forward to collaborating with
>>> those communities, as well as other Apache communities.
>>>
>>> === An Excessive Fascination with the Apache Brand ===
>>> Tez solves a real need for generic task DAG management in the Apache
>> Hadoop
>>> ecosystem, something which has been addressed in a very ad hoc manner so
>>> far
>>> by multiple Apache projects. Our rationale for developing Tez as an
>> Apache
>>> project is detailed in the Rationale section. We believe that the Apache
>>> brand
>>> and community process will help us attract more contributors to this
>>> project,
>>> and help establish ubiquitous APIs.
>>>
>>> == Documentation ==
>>> http://wiki.apache.org/incubator/TezProposal
>>>
>>> == Initial Source ==
>>> Available as a patch.
>>>
>>> == Cryptography ==
>>> Tez will eventually support encryption on the wire. This is not one of
>> the
>>> initial
>>> goals, and we do not expect Tez to be a controlled export item due to the
>>> use
>>> of encryption.
>>>
>>> == Required Resources ==
>>>
>>> === Mailing List ===
>>>  * tez-private
>>>  * tez-dev
>>>  * tez-user
>>>
>>> === Subversion Directory ===
>>> Git is the preferred source control system: git://git.apache.org/tez
>>>
>>> === Issue Tracking ===
>>>
>>> JIRA Tez (TEZ)
>>>
>>> == Initial Committers ==
>>>  * Alan Gates <gates at apache dot org>
>>>  * Arun C Murthy <acmurthy at apache dot org>
>>>  * Ashutosh Chauhan <hashutosh at apache dot org>
>>>  * Bikas Saha <bikas at apache dot org>
>>>  * Chris Douglas <cdouglas at apache dot org>
>>>  * Daryn Sharp <daryn at apache dot org>
>>>  * Devaraj Das <ddas at apache dot org>
>>>  * Gopal Vijayaraghavan <gopal at hortonworks dot com>
>>>  * Gunther Hagleitner <ghagleitner at hortonworks dot com>
>>>  * Hitesh Shah <hitesh at apache dot org>
>>>  * Jason Lowe <jlowe at apache dot org>
>>>  * Jean Xu <jeanxu at facebook dot com>
>>>  * Jitendra Pandey <jitendra at apache dot org>
>>>  * Kevin Wilfong <kevinwilfong at apache dot org>
>>>  * Mike Liddell <mike dot lidell at microsoft dot com>
>>>  * Namit Jain <namit at apache dot org>
>>>  * Owen O'Malley <omalley at apache dot org>
>>>  * Robert Evans <bobby at apache dot org>
>>>  * Siddharth Seth <sseth at apache dot org>
>>>  * Tom White <tomwhite at apache dot org>
>>>  * Thomas Graves <tgraves at apache dot org>
>>>  * Vikram Dixit <vikram at apache dot org>
>>>  * Vinod Kumar Vavilapalli <vinodkv at apache dot org>
>>>
>>> == Affiliations ==
>>> The initial committers are employees of Cloudera, Facebook, Hortonworks,
>>> Microsoft  and Yahoo Inc.
>>>
>>>  * Alan Gates - Hortonworks
>>>  * Arun C Murthy - Hortonworks
>>>  * Ashutosh Chauhan - Hortonworks
>>>  * Bikas Saha - Hortonworks
>>>  * Chris Douglas - Microsoft
>>>  * Daryn Sharp - Yahoo
>>>  * Devaraj Das - Hortonworks
>>>  * Gopal Vijayaraghavan - Hortonworks
>>>  * Gunther Hagleitner - Hortonworks
>>>  * Hitesh Shah - Hortonworks
>>>  * Jason Lowe - Yahoo
>>>  * Jean Xu - Facebook
>>>  * Jitendra Pandey - Hortonworks
>>>  * Kevin Wilfong - Facebook
>>>  * Mike Liddell - Microsoft
>>>  * Namit Jain - Facebook
>>>  * Owen O'Malley - Hortonworks
>>>  * Robert Evans - Yahoo
>>>  * Siddharth Seth - Hortonworks
>>>  * Tom White - Cloudera
>>>  * Thomas Graves - Yahoo
>>>  * Vikram Dixit - Hortonworks
>>>  * Vinod Kumar Vavilapalli - Hortonworks
>>>
>>> The nominated mentors are employees of Hortonworks,
>>> NASA JPL and Microsoft.
>>>
>>>  * Alan Gates - Hortonworks
>>>  * Arun C Murthy - Hortonworks
>>>  * Chris Douglas - Microsoft
>>>  * Chris Mattman - NASA JPL
>>>  * Owen O'Malley - Hortonworks
>>>
>>> == Sponsors ==
>>>
>>> === Champion ===
>>> Arun C Murthy <acmurthy at apache dot org>
>>>
>>> === Nominated Mentors ===
>>>  * Alan Gates <gates at apache dot org> – Architect at Hortonworks.
>>> Committer for Pig.
>>>  * Arun C Murthy <acmurthy at apache dot org> – Architect at
>>> Hortonworks. Committer for Hadoop.
>>>  * Chris Douglas <cdouglas at apache dot org> - Sr. Research Engineer at
>>> Microsoft. Committer for Hadoop.
>>>  * Chris Mattman <mattmann at apache dot org> - Sr. Computer Scientist,
>>> NASA JPL. Committer for Nutch, OODT and Tika.
>>>  * Owen O'Malley <omalley at apache dot org> – Architect at
>> Hortonworks.
>>> Committer for Hadoop, Ambari.
>>>
>>> === Sponsoring Entity ===
>>> Incubator
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message