incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "PigProposal" by OlgaN
Date Tue, 11 Sep 2007 19:01:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The following page has been changed by OlgaN:

New page:
---+!! Pig Open Source Proposal


---++ Abstract

Pig is a platform for analyzing large data sets. 

---++ Proposal

Pig consists of a language and an interactive shell. Pig's language, Pig Latin, is a simple
query algebra that lets you express data transformations such as merging data sets, filtering
them, and applying functions to records or groups of records. 

Pig Latin has several key properties:

   1 *Ease of programming*. It is trivial to achieve parallel execution of simple, "embarrassingly
parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations
are explicitly encoded as data flow sequences, making them easy to write, understand, and
   2 *Optimization opportunities*. The way in which tasks are encoded permits the system to
optimize their execution automatically, allowing the user to focus on semantics rather than
   3 *Extensibility*. Users can create their own functions to do special-purpose processing.

---++ Background

Pig started as a research project at Yahoo! in May of 2006 to combine ideas in parallel databases
and distributed computing. The first internal release took place in July 2006. The first release
was a simple front-end to the Hadoop Map/Reduce framework. The following releases added new
features and evolved the language based on user feedback. In July 2007, pig was taken over
by a development team and the first production version is due to be released on 9/28/07.

Since its inception, we had observed a steady growth of the user community within Yahoo!.
 In April 2007, Pig was released under a BSD license.  Several external parties are using
this version and have expressed interest in collaborating on its development.

---++ Rationale

In an information-centric world, innovation is driven by ad-hoc analysis of large data sets.
For example, search engine companies routinely deploy and refine services based on analyzing
the recorded behavior of users, publishers, and advertisers. The rate of innovation depends
on the efficiency with which data can be

To analyze large data sets efficiently, one needs parallelism. The cheapest and most scalable
form of parallelism is cluster computing. Unfortunately, programming for a cluster computing
environment is difficult and time-consuming. Pig makes it easy to harness the power of cluster
computing for ad-hoc data analysis. 

While other language exist that try to achieve the same goals, we believe that Pig provides
more flexibility and gives more control to the end user. 

SQL typically requires (1) importing data from a user's preferred format into a database system's
internal format (2) well-structured, normalized data with a declared schema, and (3) programs
expressed in declarative SELECT-FROM-WHERE blocks. In contrast, Pig Latin facilitates (1)
interoperability, i.e. data may be read/written in a format accepted by other applications
such as text editors or graph generators (2) flexibility, i.e. data may be loosely structured
or have structure that is
defined operationally, and (3) adoption by programmers who find procedural programming more
natural than declarative programming.

Sawzall [5] is a scripting language used at Google on top of Map-Reduce. A sawzall program
has a fairly rigid structure consisting of a filtering phase (the map step) followed by an
aggregation phase (the reduce step). Furthermore, only the filtering phase can be written
by the user, and only a pre-built set of aggregations are available (new ones are non-trivial
to add). While Pig Latin has similar higher level primitives like filtering and aggregation,
an arbitrary number of them can be flexibly chained together in a Pig Latin program, and all
primitives can use user-defined functions with equal ease. Further, Pig Latin has additional
primitives such as cogrouping, that allow operations such as joins (which require multiple
programs in Sawzall) to be written in a single line in Pig Latin. Further, Pig Latin is designed
to be embedded into other languages, and can use functions written in other languages. Thus,
in contrast to Sawzall, it directly caters to a large community of developers without having
to make them learn an entirely new programming language.

---++ Current Status

---+++ Meritocracy 

Pig was started as a project that was developed by Yahoo! research team. Recently we have
added a development team that works in harmony with the research team with both teams actively
and successfully contributing to the project. We are planning to create the environment that
encourages meritocracy and is consistent with the meritocracy principles of Apache. Within
the team we have people actively participating in the Hadoop project.

---+++ Community

Pig has an active user community within Yahoo! that has been steadily growing. Pig also attracted
external users since its release under a BSD license.  Several external parties are using
the product and have expressed interest in collaborating on its development.

Also, since the current version of Pig is built on top of the Hadoop we believe that we will
be able to quickly extend our community by attracting both the Hadoop users and developers
to the project.

---+++ Core Developers

Our contributors come from both research and development world and most have background in
database internals and large scale distributed systems.

---+++ Alignment

Yahoo! seeks to develop Pig collaboratively with others, not to control and maintain it independently.
 Apache offers the best legal and social framework for such community-based software development.

Also, the current version of Pig runs on top of the Hadoop's Map-Reduce infrastructure which
is part of Apache. We believe there would be a lot of synergy between the projects both in
terms of users and developers.

---++ Known Risks
---+++ Orphaned products

All current contributors are part of Yahoo which is a major player in the space and is committed
to grid computing. Also we expect high degree of synergy with Hadoop project.

---+++ Inexperience with Open Source

Two of the committers have extensive experience with open source and Apache. The rest are
new to open source and will be guided through the process by the team members with experience.

---+++ Homogenous Developers

The current list of committers is confined to Yahoo employees. Our plan is to recruit more
committers once the project gets on the way.

---+++ Reliance on Salaried Developers

Currently, all contributors are Yahoo employees. By extending the development community we
are hoping to mitigate this risk.

---+++ Relationships with Other Apache Products

Pig is built on top of Hadoop and we expect deep collaboration with Hadoop project.

---+++ An Excessive Fascination with the Apache Brand

Yahoo already have a strong brand and is not interested in Apache as a way to gain visibility.
Yahoo! seeks to develop Pig collaboratively with others, not to control and maintain it independently.
 Apache offers the best legal and social framework for such community-based software development.

---++ Documentation

---++ Initial Source

The initial source will be donated by Yahoo Inc. The donating company will contribute the
initial code base once the proposal is accepted and necessary infrastructure has been set

---++ External Dependencies

   * bzip2: license
   * javacc: license
   * hadoop: license

---++ Required Resources
---+++ Mailing lists

We would need the following mailing lists
   * pig-private (with moderated subscriptions)
   * pig-dev
   * pig-commits
   * pig-user

---+++ Subversion Directory

---+++ Issue Tracking


---++ Initial Committers

   * Nigel Daley (
   * Alan Gates (
   * Olga Natkovich (
   * Chris Olston (
   * Owen O'Malley (
   * Ben Reed (
   * Utkarsh Srivastava (

---++ Affiliation

All initial committers are affiliated with Yahoo!

---++ Sponsors

---+++ Champion

Doug Cutting 

---+++ Nominated Mentors

Doug Cutting 

---+++ Sponsoring Entity 


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message