Return-Path: Delivered-To: apmail-incubator-general-archive@www.apache.org Received: (qmail 30402 invoked from network); 25 Sep 2007 18:32:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Sep 2007 18:32:24 -0000 Received: (qmail 14560 invoked by uid 500); 25 Sep 2007 18:32:13 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 14443 invoked by uid 500); 25 Sep 2007 18:32:13 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 14432 invoked by uid 99); 25 Sep 2007 18:32:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2007 11:32:12 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [208.210.124.79] (HELO rune.pobox.com) (208.210.124.79) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2007 18:34:24 +0000 Received: from rune (localhost [127.0.0.1]) by rune.pobox.com (Postfix) with ESMTP id 98ED013A0FE for ; Tue, 25 Sep 2007 14:32:07 -0400 (EDT) Received: from [10.15.6.100] (smtp.ninginc.com [66.17.148.6]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by rune.sasl.smtp.pobox.com (Postfix) with ESMTP id 5B37213A563 for ; Tue, 25 Sep 2007 14:31:35 -0400 (EDT) Mime-Version: 1.0 (Apple Message framework v752.3) In-Reply-To: <46F94341.1000405@apache.org> References: <4304BE85.5060103@atanion.com> <1124738125.7317.7.camel@localhost> <5a99335f05082300131ed36348@mail.gmail.com> <8b3ce37905083017542e7842a1@mail.gmail.com> <4315B22F.4020600@atanion.com> <8b3ce379050831064426446567@mail.gmail.com> <46F94341.1000405@apache.org> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <626E2442-E514-4C2D-A93D-1D4F0078A848@skife.org> Content-Transfer-Encoding: 7bit From: Brian McCallister Subject: Re: [VOTE] accept Pig into Incubator Date: Tue, 25 Sep 2007 11:31:09 -0700 To: general@incubator.apache.org X-Mailer: Apple Mail (2.752.3) X-Virus-Checked: Checked by ClamAV on apache.org +1 -Brian On Sep 25, 2007, at 10:20 AM, Doug Cutting wrote: > I would like to call the Incubator PMC to vote to incubate the > proposed Pig project. Discussion on this list evidenced broad > interest in this project, which bodes well for its ability to build > a diverse developer community. > > http://wiki.apache.org/incubator/PigProposal > > +1 > > Doug > > ----------------------------------------------------------- > > = Proposal for Pig Project = > > == Abstract == > > Pig is a platform for analyzing large data sets. > > == Proposal == > > The Pig project consists of high-level languages for expressing > data analysis programs, coupled with infrastructure for evaluating > these programs. The salient property of Pig programs is that their > structure is amenable to substantial parallelization, which in > turns enables them to handle very large data sets. > > At the present time, Pig's infrastructure layer consists of a > compiler that produces sequences of Map-Reduce programs, for which > large-scale parallel implementations already exist (e.g., the > Hadoop subproject). Pig's language layer currently consists of a > textual language called Pig Latin, which has the following key > properties: > > 1. ''Ease of programming''. It is trivial to achieve parallel > execution of simple, "embarrassingly parallel" data analysis tasks. > Complex tasks comprised of multiple interrelated data > transformations are explicitly encoded as data flow sequences, > making them easy to write, understand, and maintain. > 2. ''Optimization opportunities''. The way in which tasks are > encoded permits the system to optimize their execution > automatically, allowing the user to focus on semantics rather than > efficiency. > 3. ''Extensibility''. Users can create their own functions to do > special-purpose processing. > > == Background == > > Pig started as a research project at Yahoo! in May of 2006 to > combine ideas in parallel databases and distributed computing. The > first internal release took place in July 2006. The first release > was a simple front-end to the Hadoop Map/Reduce framework. The > following releases added new features and evolved the language > based on user feedback. In July 2007, pig was taken over by a > development team and the first production version is due to be > released on 9/28/07. > > Since its inception, we had observed a steady growth of the user > community within Yahoo!. In April 2007, Pig was released under a > BSD-type license. Several external parties are using this version > and have expressed interest in collaborating on its development. > > == Rationale == > > In an information-centric world, innovation is driven by ad-hoc > analysis of large data sets. For example, search engine companies > routinely deploy and refine services based on analyzing the > recorded behavior of users, publishers, and advertisers. The rate > of innovation depends on the efficiency with which data can be > analyzed. > > To analyze large data sets efficiently, one needs parallelism. The > cheapest and most scalable form of parallelism is cluster > computing. Unfortunately, programming for a cluster computing > environment is difficult and time-consuming. Pig makes it easy to > harness the power of cluster computing for ad-hoc data analysis. > > While other language exist that try to achieve the same goals, we > believe that Pig provides more flexibility and gives more control > to the end user. > > SQL typically requires (1) importing data from a user's preferred > format into a database system's internal format (2) well- > structured, normalized data with a declared schema, and (3) > programs expressed in declarative SELECT-FROM-WHERE blocks. In > contrast, Pig Latin facilitates (1) interoperability, i.e. data may > be read/written in a format accepted by other applications such as > text editors or graph generators (2) flexibility, i.e. data may be > loosely structured or have structure that is > defined operationally, and (3) adoption by programmers who find > procedural programming more natural than declarative programming. > > Sawzall is a scripting language used at Google on top of Map- > Reduce. A sawzall program has a fairly rigid structure consisting > of a filtering phase (the map step) followed by an aggregation > phase (the reduce step). Furthermore, only the filtering phase can > be written by the user, and only a pre-built set of aggregations > are available (new ones are non-trivial to add). While Pig Latin > has similar higher level primitives like filtering and aggregation, > an arbitrary number of them can be flexibly chained together in a > Pig Latin program, and all primitives can use user-defined > functions with equal ease. Further, Pig Latin has additional > primitives such as cogrouping, that allow operations such as joins > (which require multiple programs in Sawzall) to be written in a > single line in Pig Latin. Further, Pig Latin is designed > to be embedded into other languages, and can use functions written > in other languages. Thus, in contrast to Sawzall, it directly > caters to a large community of developers without having to make > them learn an entirely new programming language. > > == Current Status == > > === Meritocracy === > > Pig was started as a project that was developed by Yahoo! research > team. Recently we have added a development team that works in > harmony with the research team with both teams actively and > successfully contributing to the project. We are planning to create > the environment that encourages meritocracy and is consistent with > the meritocracy principles of Apache. Within the team we have > people actively participating in the Hadoop subproject. > > === Community === > > Pig has an active user community within Yahoo! that has been > steadily growing. Pig also attracted external users since its > release under a BSD-type license. Several external parties are > using the product and have expressed interest in collaborating on > its development. > > Also, since the current version of Pig is built on top of the > Hadoop we believe that we will be able to quickly extend our > community by attracting both the Hadoop users and developers to the > project. > > === Core Developers === > > Our contributors come from both research and development world and > most have background in database internals and large scale > distributed systems. > > === Alignment === > > Yahoo! seeks to develop Pig collaboratively with others, not to > control and maintain it independently. Apache offers the best > legal and social framework for such community-based software > development. > > Also, the current version of Pig runs on top of the Hadoop's Map- > Reduce infrastructure which is part of Apache. We believe there > would be a lot of synergy between the projects both in terms of > users and developers. > > == Known Risks == > === Orphaned products === > > All current contributors are part of Yahoo which is a major player > in the space and is committed to grid computing. Also we expect > high degree of synergy with Hadoop subproject. > > === Inexperience with Open Source === > > Two of the committers have extensive experience with open source > and Apache. The rest are new to open source and will be guided > through the process by the team members with experience. > > === Homogenous Developers === > > The current list of committers is confined to Yahoo employees. Our > plan is to recruit more committers once the project gets on the way. > > === Reliance on Salaried Developers === > > Currently, all contributors are Yahoo employees. By extending the > development community we are hoping to mitigate this risk. > > === Relationships with Other Apache Products === > > Pig is built on top of Hadoop and we expect deep collaboration with > Hadoop subproject. > > === An Excessive Fascination with the Apache Brand === > > Yahoo already have a strong brand and is not interested in Apache > as a way to gain visibility. Yahoo! seeks to develop Pig > collaboratively with others, not to control and maintain it > independently. Apache offers the best legal and social framework > for such community-based software development. > > == Documentation == > > http://research.yahoo.com/project/pig > > == Initial Source == > > The initial source will be donated by Yahoo Inc. The donating > company will contribute the initial code base once the proposal is > accepted and necessary infrastructure has been set up. > > == External Dependencies == > > 1. bzip2: http://www.kohsuke.org/bzip2/:Apache license > 2. javacc: https://javacc.dev.java.net/:BSD license > 3. hadoop: http://lucene.apache.org/hadoop/:Apache license > 4. log4j: http://logging.apache.org/log4j/: Apache license > 5. jsch: http://www.jcraft.com/jsch: BSD style license: http:// > www.jcraft.com/jsch/LICENSE.txt > > == Required Resources == > == Mailing lists == > > We would need the following mailing lists > 1. pig-private (with moderated subscriptions) > 2. pig-dev > 3. pig-commits > 4. pig-user > > === Subversion Directory === > > https://svn.apache.org/repos/asf/incubator/pig > > === Issue Tracking === > > JIRA PIG (PIG) > > == Initial Committers == > > 1. Nigel Daley (ndaley@yahoo-inc.com) > 2. Alan Gates (gates@yahoo-inc.com) > 3. Olga Natkovich (olgan@yahoo-inc.com) > 4. Chris Olston (olston@yahoo-inc.com) > 5. Owen O'Malley (oom@yahoo-inc.com) > 6. Ben Reed (breed@yahoo-inc.com) > 7. Utkarsh Srivastava (utkarsh@yahoo-inc.com) > > == Affiliation == > > All initial committers are affiliated with Yahoo! > > == Sponsors == > > === Champion === > > Doug Cutting > > === Nominated Mentors === > > 1. Doug Cutting > 2. Torsten Curdt > 3. Bertrand Delacretaz > 4. Yoav Shapira > 5. Sylvain Wallez > > === Sponsoring Entity === > > Incubator > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > For additional commands, e-mail: general-help@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org