incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: [VOTE] accept Pig into Incubator
Date Tue, 25 Sep 2007 18:45:21 GMT
+1

	Erik


On Sep 25, 2007, at 1:20 PM, Doug Cutting wrote:

> I would like to call the Incubator PMC to vote to incubate the  
> proposed Pig project.  Discussion on this list evidenced broad  
> interest in this project, which bodes well for its ability to build  
> a diverse developer community.
>
> http://wiki.apache.org/incubator/PigProposal
>
> +1
>
> Doug
>
> -----------------------------------------------------------
>
> = Proposal for Pig Project =
>
> == Abstract ==
>
> Pig is a platform for analyzing large data sets.
>
> == Proposal ==
>
> The Pig project consists of high-level languages for expressing  
> data analysis programs, coupled with infrastructure for evaluating  
> these programs. The salient property of Pig programs is that their  
> structure is amenable to substantial parallelization, which in  
> turns enables them to handle very large data sets.
>
> At the present time, Pig's infrastructure layer consists of a  
> compiler that produces sequences of Map-Reduce programs, for which  
> large-scale parallel implementations already exist (e.g., the  
> Hadoop subproject). Pig's language layer currently consists of a  
> textual language called Pig Latin, which has the following key  
> properties:
>
>  1. ''Ease of programming''. It is trivial to achieve parallel  
> execution of simple, "embarrassingly parallel" data analysis tasks.  
> Complex tasks comprised of multiple interrelated data  
> transformations are explicitly encoded as data flow sequences,  
> making them easy to write, understand, and maintain.
>  2. ''Optimization opportunities''. The way in which tasks are  
> encoded permits the system to optimize their execution  
> automatically, allowing the user to focus on semantics rather than  
> efficiency.
>  3. ''Extensibility''. Users can create their own functions to do  
> special-purpose processing.
>
> == Background ==
>
> Pig started as a research project at Yahoo! in May of 2006 to  
> combine ideas in parallel databases and distributed computing. The  
> first internal release took place in July 2006. The first release  
> was a simple front-end to the Hadoop Map/Reduce framework. The  
> following releases added new features and evolved the language  
> based on user feedback. In July 2007, pig was taken over by a  
> development team and the first production version is due to be  
> released on 9/28/07.
>
> Since its inception, we had observed a steady growth of the user  
> community within Yahoo!.  In April 2007, Pig was released under a  
> BSD-type license.  Several external parties are using this version  
> and have expressed interest in collaborating on its development.
>
> == Rationale ==
>
> In an information-centric world, innovation is driven by ad-hoc  
> analysis of large data sets. For example, search engine companies  
> routinely deploy and refine services based on analyzing the  
> recorded behavior of users, publishers, and advertisers. The rate  
> of innovation depends on the efficiency with which data can be
> analyzed.
>
> To analyze large data sets efficiently, one needs parallelism. The  
> cheapest and most scalable form of parallelism is cluster  
> computing. Unfortunately, programming for a cluster computing  
> environment is difficult and time-consuming. Pig makes it easy to  
> harness the power of cluster computing for ad-hoc data analysis.
>
> While other language exist that try to achieve the same goals, we  
> believe that Pig provides more flexibility and gives more control  
> to the end user.
>
> SQL typically requires (1) importing data from a user's preferred  
> format into a database system's internal format (2) well- 
> structured, normalized data with a declared schema, and (3)  
> programs expressed in declarative SELECT-FROM-WHERE blocks. In  
> contrast, Pig Latin facilitates (1) interoperability, i.e. data may  
> be read/written in a format accepted by other applications such as  
> text editors or graph generators (2) flexibility, i.e. data may be  
> loosely structured or have structure that is
> defined operationally, and (3) adoption by programmers who find  
> procedural programming more natural than declarative programming.
>
> Sawzall is a scripting language used at Google on top of Map- 
> Reduce. A sawzall program has a fairly rigid structure consisting  
> of a filtering phase (the map step) followed by an aggregation  
> phase (the reduce step). Furthermore, only the filtering phase can  
> be written by the user, and only a pre-built set of aggregations  
> are available (new ones are non-trivial to add). While Pig Latin  
> has similar higher level primitives like filtering and aggregation,  
> an arbitrary number of them can be flexibly chained together in a  
> Pig Latin program, and all primitives can use user-defined  
> functions with equal ease. Further, Pig Latin has additional  
> primitives such as cogrouping, that allow operations such as joins  
> (which require multiple programs in Sawzall) to be written in a  
> single line in Pig Latin. Further, Pig Latin is designed
> to be embedded into other languages, and can use functions written  
> in other languages. Thus, in contrast to Sawzall, it directly  
> caters to a large community of developers without having to make  
> them learn an entirely new programming language.
>
> == Current Status ==
>
> === Meritocracy ===
>
> Pig was started as a project that was developed by Yahoo! research  
> team. Recently we have added a development team that works in  
> harmony with the research team with both teams actively and  
> successfully contributing to the project. We are planning to create  
> the environment that encourages meritocracy and is consistent with  
> the meritocracy principles of Apache. Within the team we have  
> people actively participating in the Hadoop subproject.
>
> === Community ===
>
> Pig has an active user community within Yahoo! that has been  
> steadily growing. Pig also attracted external users since its  
> release under a BSD-type license.  Several external parties are  
> using the product and have expressed interest in collaborating on  
> its development.
>
> Also, since the current version of Pig is built on top of the  
> Hadoop we believe that we will be able to quickly extend our  
> community by attracting both the Hadoop users and developers to the  
> project.
>
> === Core Developers ===
>
> Our contributors come from both research and development world and  
> most have background in database internals and large scale  
> distributed systems.
>
> === Alignment ===
>
> Yahoo! seeks to develop Pig collaboratively with others, not to  
> control and maintain it independently.  Apache offers the best  
> legal and social framework for such community-based software  
> development.
>
> Also, the current version of Pig runs on top of the Hadoop's Map- 
> Reduce infrastructure which is part of Apache. We believe there  
> would be a lot of synergy between the projects both in terms of  
> users and developers.
>
> == Known Risks ==
> === Orphaned products ===
>
> All current contributors are part of Yahoo which is a major player  
> in the space and is committed to grid computing. Also we expect  
> high degree of synergy with Hadoop subproject.
>
> === Inexperience with Open Source ===
>
> Two of the committers have extensive experience with open source  
> and Apache. The rest are new to open source and will be guided  
> through the process by the team members with experience.
>
> === Homogenous Developers ===
>
> The current list of committers is confined to Yahoo employees. Our  
> plan is to recruit more committers once the project gets on the way.
>
> === Reliance on Salaried Developers ===
>
> Currently, all contributors are Yahoo employees. By extending the  
> development community we are hoping to mitigate this risk.
>
> === Relationships with Other Apache Products ===
>
> Pig is built on top of Hadoop and we expect deep collaboration with  
> Hadoop subproject.
>
> === An Excessive Fascination with the Apache Brand ===
>
> Yahoo already have a strong brand and is not interested in Apache  
> as a way to gain visibility. Yahoo! seeks to develop Pig  
> collaboratively with others, not to control and maintain it  
> independently.  Apache offers the best legal and social framework  
> for such community-based software development.
>
> == Documentation ==
>
> http://research.yahoo.com/project/pig
>
> == Initial Source ==
>
> The initial source will be donated by Yahoo Inc. The donating  
> company will contribute the initial code base once the proposal is  
> accepted and necessary infrastructure has been set up.
>
> == External Dependencies ==
>
>  1. bzip2: http://www.kohsuke.org/bzip2/:Apache license
>  2. javacc: https://javacc.dev.java.net/:BSD license
>  3. hadoop: http://lucene.apache.org/hadoop/:Apache license
>  4. log4j: http://logging.apache.org/log4j/: Apache license
>  5. jsch: http://www.jcraft.com/jsch: BSD style license: http:// 
> www.jcraft.com/jsch/LICENSE.txt
>
> == Required Resources ==
> == Mailing lists ==
>
> We would need the following mailing lists
>  1. pig-private (with moderated subscriptions)
>  2. pig-dev
>  3. pig-commits
>  4. pig-user
>
> === Subversion Directory ===
>
> https://svn.apache.org/repos/asf/incubator/pig
>
> === Issue Tracking ===
>
> JIRA PIG (PIG)
>
> == Initial Committers ==
>
>  1. Nigel Daley (ndaley@yahoo-inc.com)
>  2. Alan Gates (gates@yahoo-inc.com)
>  3. Olga Natkovich (olgan@yahoo-inc.com)
>  4. Chris Olston (olston@yahoo-inc.com)
>  5. Owen O'Malley (oom@yahoo-inc.com)
>  6. Ben Reed (breed@yahoo-inc.com)
>  7. Utkarsh Srivastava (utkarsh@yahoo-inc.com)
>
> == Affiliation ==
>
> All initial committers are affiliated with Yahoo!
>
> == Sponsors ==
>
> === Champion ===
>
> Doug Cutting
>
> === Nominated Mentors ===
>
>    1. Doug Cutting
>    2. Torsten Curdt
>    3. Bertrand Delacretaz
>    4. Yoav Shapira
>    5. Sylvain Wallez
>
> === Sponsoring Entity ===
>
> Incubator
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message