pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan" <...@yahoo-inc.com>
Subject RE: Begin a discussion about Pig as a top level project
Date Sat, 03 Apr 2010 19:43:05 GMT
I see this as a multi-part question. Looking back at some of the
significant roadmap/existential questions asked in the last 12 months, I
see the following:

1. With the introduction of SQL, what is the philosophy of Pig (I sent
an email about this approximately 9 months ago)
2. What is the approach to support backward compatibility in Pig (Alan
had sent an email about this 3 months ago)
3. Should Pig be a TLP (the current email thread).

Here is my take on answering the aforementioned questions.

The initial philosophy of Pig was to be backend agnostic. It was
designed as a data flow language. Whenever a new language is designed,
the syntax and semantics of the language have to be laid out. The syntax
is usually captured in the form of a BNF grammar. The semantics are
defined by the language creators. Backward compatibility is then a
question of holding true to the syntax and semantics. With Pig, in
addition to the language, the Java APIs were exposed to customers to
implement UDFs (load/store/filter/grouping/row transformation etc),
provision looping since the language does not support looping constructs
and also support a programmatic mode of access. Backward compatibility
in this context is to support API versioning.

Do we still intend to position as a data flow language that is backend
agnostic? If the answer is yes, then there is a strong case for making
Pig a TLP.

Are we influenced by Hadoop? A big YES! The reason Pig chose to become a
Hadoop sub-project was to ride the Hadoop popularity wave. As a
consequence, we chose to be heavily influenced by the Hadoop roadmap.

Like a good lawyer, I also have rebuttals to Alan's questions :)

1. Search engine popularity - We can discuss this with the Hadoop team
and still retain links to TLP's that are coupled (loosely or tightly).
2. Explicit connection to Hadoop - I see this as logical connection v/s
physical connection. Today, we are physically connected as a
sub-project. Becoming a TLP, will not increase/decrease our influence on
the Hadoop community (think Logical, Physical and MR Layers :)
3. Philosophy - I have already talked about this. The tight coupling is
by choice. If Pig continues to be a data flow language with clear syntax
and semantics then someone can implement Pig on top of a different
backend. Do we intend to take this approach?

I just wanted to offer a different opinion to this thread. I strongly
believe that we should think about the original philosophy. Will we have
a Pig standards committee that will decide on the changes to the
language (think C/C++) if there are multiple backend implementations?

I will reserve my vote based on the outcome of the philosophy and
backward compatibility discussions. If we decide that Pig will be
treated and maintained like a true language with clear syntax and
semantics then we have a strong case to make it into a TLP. If not, we
should retain our existing ties to Hadoop and make Pig into a data flow
language for Hadoop.


-----Original Message-----
From: Thejas Nair [mailto:tejas@yahoo-inc.com] 
Sent: Friday, April 02, 2010 4:08 PM
To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy
Subject: Re: Begin a discussion about Pig as a top level project

I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and
heavily influenced by its roadmap. I think it makes sense to continue as
a sub-project of hadoop.


On 3/31/10 4:04 PM, "Dmitriy Ryaboy" <dvryaboy@gmail.com> wrote:

> Over time, Pig is increasing its coupling to Hadoop (for good 
> reasons), rather than decreasing it. If and when Pig becomes a viable 
> entity without hadoop around, it might make sense as a TLP. As is, I 
> think becoming a TLP will only introduce unnecessary administrative
and bureaucratic headaches.
> So my vote is also -1.
> -Dmitriy
> On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates <gates@yahoo-inc.com>
>> So far I haven't seen any feedback on this.  Apache has asked the 
>> Hadoop PMC to submit input in April on whether some subprojects 
>> should be promoted to TLPs.  We, the Pig community, need to give 
>> feedback to the Hadoop PMC on how we feel about this.  Please make
your voice heard.
>> So now I'll head my own call and give my thoughts on it.
>> The biggest advantage I see to being a TLP is a direct connection to 
>> Apache.  Right now all of the Pig team's interaction with Apache is 
>> through the Hadoop PMC.  Being directly connected to Apache would 
>> benefit Pig team members who would have a better view into Apache.  
>> It would also raise our profile in Apache and thus make other
projects more aware of us.
>> However, I am concerned about loosing Pig's explicit connection to
>>  This concern has a couple of dimensions.  One, Hadoop and MapReduce 
>> are the current flavor of the month in computing.  Given that Pig 
>> shares a name with the common farm animal, it's hard to be sure based
on search statistics.
>>  But Google trends shows that "hadoop" is searched on much more 
>> frequently than "hadoop pig" or "apache pig" (see 
>> http://www.google.com/trends?q=hadoop%2Chadoop+pig).  I am guessing 
>> that most Pig users come from Hadoop users who discover Pig via
Hadoop's website.
>>  Loosing that subproject tab on Hadoop's front page may radically 
>> lower the number of users coming to Pig to check out our project.  I 
>> would argue that this benefits Hadoop as well, since high level 
>> languages like Pig Latin have the potential to greatly extend the
user base and usability of Hadoop.
>> Two, being explicitly connected to Hadoop keeps our two communities 
>> aware of each others needs.  There are features proposed for MR that 
>> would greatly help Pig.  By staying in the Hadoop community Pig is 
>> better positioned to advocate for and help implement and test those 
>> features.  The response to this will be that Pig developers can still

>> subscribe to Hadoop mailing lists, submit patches, etc.  That is, 
>> they can still be part of the Hadoop community.  Which reinforces my 
>> point that it makes more sense to leave Pig in the Hadoop community 
>> since Pig developers will need to be part of that community anyway.
>> Finally, philosophically it makes sense to me that projects that are 
>> tightly connected belong together.  It strikes me as strange to have 
>> Pig as a TLP completely dependent on another TLP.  Hadoop was 
>> originally a subproject of Lucene.  It moved out to be a TLP when it 
>> became obvious that Hadoop had become independent of and useful apart

>> from Lucene.  Pig is not in that position relative to Hadoop.
>> So, I'm -1 on Pig moving out.  But this is a soft -1.  I'm open to 
>> being persuaded that I'm wrong or my concerns can be addressed while 
>> still having Pig as a TLP.
>> Alan.
>> On Mar 19, 2010, at 10:59 AM, Alan Gates wrote:
>>  You have probably heard by now that there is a discussion going on 
>> in the
>>> Hadoop PMC as to whether a number of the subprojects (Hbase, Avro, 
>>> Zookeeper, Hive, and Pig) should move out from under the Hadoop 
>>> umbrella and become top level Apache projects (TLP).  This 
>>> discussion has picked up recently since the Apache board has clearly

>>> communicated to the Hadoop PMC that it is concerned that Hadoop is 
>>> acting as an umbrella project with many disjoint subprojects 
>>> underneath it.  They are concerned that this gives Apache little 
>>> insight into the health and happenings of the subproject communities

>>> which in turn means Apache cannot properly mentor those communities.
>>> The purpose of this email is to start a discussion within the Pig 
>>> community about this topic.  Let me cover first what becoming TLP 
>>> would mean for Pig, and then I'll go into what options I think we as
a community have.
>>> Becoming a TLP would mean that Pig would itself have a PMC that 
>>> would report directly to the Apache board.  Who would be on the PMC 
>>> would be something we as a community would need to decide.  Common 
>>> options would be to say all active committers are on the PMC, or all

>>> active committers who have been a committer for at least a year.  We

>>> would also need to elect a chair of the PMC.  This lucky person 
>>> would have no additional power, but would have the additional 
>>> responsibility of writing quarterly reports on Pig's status for 
>>> Apache board meetings, as well as coordinating with Apache to get 
>>> accounts for new  committers, etc.  For more information see 
>>> http://www.apache.org/foundation/how-it-works.html#roles
>>> Becoming a TLP would not mean that we are ostracized from the Hadoop

>>> community.  We would continue to be invited to Hadoop Summits, HUGs,
>>>  Since all Pig developers and users are by definition Hadoop users, 
>>> we would continue to be a strong presence in the Hadoop community.
>>> I see three ways that we as a community can respond to this:
>>> 1) Say yes, we want to be a TLP now.
>>> 2) Say yes, we want to be a TLP, but not yet.  We feel we need more 
>>> time to mature.  If we choose this option we need to be able to 
>>> clearly articulate how much time we need and what we hope to see 
>>> change in that time.
>>> 3) Say no, we feel the benefits for us staying with Hadoop outweigh 
>>> the drawbacks of being a disjoint subproject.  If we choose this, we

>>> need to be able to say exactly what those benefits are and why we 
>>> feel they will be compromised by leaving the Hadoop project.
>>> There may other options that I haven't thought of.  Please feel free

>>> to suggest any you think of.
>>> Questions?  Thoughts?  Let the discussion begin.
>>> Alan.

View raw message