hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amr Awadallah <...@cloudera.com>
Subject Re: Revisit Pig Philosophy?
Date Mon, 21 Sep 2009 22:29:57 GMT
 > Pig Latin is intended to be a language for parallel data processing. 
It is not tied to one particular parallel framework


-- amr

Alan Gates wrote:
> I agree with Milind that we should move to saying that Pig Latin is a 
> data flow language independent of any particular platform, while the 
> current implementation of Pig is tied to Hadoop.  I'm not sure how 
> thin that implementation will be, but I'm in favor of making it thin 
> where possible (such as the recent proposal to shift LoadFunc to 
> directly use InputFormat).
> I also strongly agree that we need to be more precise in our 
> terminology between Pig (the platform) and Pig Latin (the language), 
> especially as we're working on making Pig bilingual (with the addition 
> of SQL).
> I am fine with saying that Pig SQL adheres as much as possible (given 
> the underlying systems, etc.) to ANSI SQL semantics.  And where there 
> is shared functionality such as UDFs we again adhere to SQL semantics 
> when it does not conflict with other Pig goals.  So COUNT, and SUM 
> should handle nulls the way SQL does, for example.  But we need to 
> craft the statement carefully.  To see why, consider Pig's data 
> model.  We would like our types to map nicely into SQL types, so that 
> if Pig SQL users declare a column to be of type VARCHAR(32) or 
> FLOAT(10) we can map those onto some Pig type.  But we don't want to 
> use SQL types directly inside Pig, as they aren't a good match for 
> much of Pig processing.  So any statement of using SQL semantics needs 
> caveats.
> I would also vote for modifying our Pigs Live Anywhere dictum to be:
> Pig Latin is intended to be a language for parallel data processing. 
> It is not
> tied to one particular parallel framework. The initial implementation 
> of Pig  is on Hadoop and seeks to leverage the power of Hadoop 
> wherever possible.  However, nothing Hadoop specific should be exposed 
> in Pig Latin.
> We may also want to add a vocabulary section to the philosophy 
> statement to clarify between Pig and Pig Latin.
> Alan.
> On Sep 18, 2009, at 8:01 PM, Milind A Bhandarkar wrote:
>> It's Friday evening, so I have some time to discuss philosophy ;-)
>> Before we discuss any question about revisiting pig philosophy, the
>> first question that needs to be answered is "what is pig" ? (this
>> corresponds to the Hindu philosophy's basic argument, that any deep
>> personal philosophical investigations need to start with a question
>> "koham?" (in Sanskrit, it means 'who am I?'))
>> So, coming back to approx 4000 years after the origin of that
>> philosophy, we need to ask "what is pig?" (incidentally, pig, or
>> varaaha in Sanskrit, was the second incarnation of lord Vishnu in
>> hindu scriptures, but that's not relevant here.)
>> What we need to decide is, is pig is a dataflow language ? I think
>> not. "Pig Latin" is the language. Pig is referred to in countless
>> slide decks ( aka pig scriptures, btw I own 50% of these scriptures)
>> as a runtime system that interprets pig Latin, kind of like java and
>> jvm. (Duality of nature, called "dwaita" philosophy in sanskrit is
>> applicable here. But I won't go deeper than that.)
>> So, pig-Latin-the-language's stance  could still be that it could be
>> implemented on any runtime. But pig the runtime's philosophy could be
>> that it is a thin layer on top of hadoop. And all the world could
>> breathe a sigh of relief. (mostly, by not having to answer these
>> philosophical questions.)
>> So, 'koham' is the 4000 year old question this project needs to
>> answer. That's all.
>> AUM...... (it's Friday.)
>> - (swami) Milind ;-)
>> On Sep 18, 2009, at 19:05, "Jeff Hammerbacher" <hammer@cloudera.com>
>> wrote:
>>> Hey,
>>>> 2. Local mode and other parallel frameworks
>>>> <snip>
>>>> Pigs Live Anywhere
>>>> Pig is intended to be a language for parallel data processing. It
>>>> is not
>>>> tied to one particular parallel framework. It has been implemented
>>>> first
>>>> on hadoop, but we do not intend that to be only on hadoop.
>>>> </snip>
>>>> Are we still holding onto this? What about local mode? Local mode
>>>> is not
>>>> being treated on equal footing with that of Hadoop for practical
>>>> reasons. However, users expect things that work on local mode to work
>>>> without any hitches on Hadoop.
>>>> Are we still designing the system assuming that Pig will be stacked
>>>> on
>>>> top of other parallel frameworks?
>>> FWIW, I appreciate this philosophical stance from Pig. Allowing
>>> locally
>>> tested scripts to be migrated to the cluster without breakage is a
>>> noble
>>> goal, and keeping the option of (one day) developing an alternative
>>> execution environment for Pig that runs over HDFS but uses a richer
>>> physical
>>> set of operators than MapReduce would be great.
>>> Of course, those of you who are running Pig in production will have
>>> a much
>>> better sense of the feasibility, rather than desirability, of this
>>> philosophical stance.
>>> Later,
>>> Jeff

View raw message