hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Notes from Pig contributor workshop
Date Tue, 13 Jul 2010 18:23:53 GMT
On June 30th Yahoo hosted a Pig contributor workshop.  Pig  
contributors from Yahoo, Twitter, LinkedIn, and Cloudera were  
present.  The slides used for the presentations that day have been  
uploaded to http://wiki.apache.org/pig/PigTalksPapers  Here's a digest  
of what was discussed there.  For those who were there, if I forgot  
anything please feel free to add it in.

Thejas Nair discussed his work on performance.  In particular he has  
been looking into how to more efficiently de/serialize complex data  
types and when Pig can make use of lazy deserialization.  Dmitriy  
Ryaboy brought up the question of whether Pig would be open to using  
Avro for de/serialization between Map and Reduce and between MR jobs.   
We concluded that we are open to using whatever is fast.

Richard Ding discussed the work he has been doing to make Pig run  
statistics available to users via the logs, applications running Pig  
(such as workflow systems) via a new PigRunner API, and to developers  
via Hadoop job history files. Russell Jurney brought up that it would  
be nice if this API also included record input and output on a per MR  
job level so that users diagnosing issues with their Pig Latin scripts  
would have a better idea in which MR job things went wrong.

Ashutosh Chauhan gave an overview of the work that has been going on  
to add UDFs in scripting languages to Pig (PIG-928).

Daniel Dai talked about the rewrite of the logical optimizer that he  
has been doing, including an overview of the major rules being  
implemented in the new optimizer framework.  Dmitriy indicated that he  
would really like to see pushing of limits into the RecordReader (so  
that we can terminate reading early) added to the list of rules.  This  
would involve making use of the new optimizer framework in the MR  
optimizer.  Alan Gates indicated that while he does not believe we  
should translate the entire set of MR optimizer visitors into the new  
framework until we've further tested the framework, this might be a  
good first test for the new optimizer in the MR optimizer.

Aniket Mokashi showed the work he's been doing to add a custom  
partitioner to Pig.  He also covered his work to add the ability to re- 
use a relation that contains a single record with a single field as a  
scalar.  Dmitriy pointed out that we need to make sure this uses the  
distributed cache to minimize strain on the namenode.

Pradeep Kamath gave a short presentation on Howl, the work he is  
leading to create a shared metadata system between Pig, Hive, and Map  
Reduce.  Dmitriy noted that we need to get this work more in the open  
so others can participate and contribute.

Russell Jurney talked about his work on adding datetime types to Pig.   
He indicated he was interested in using Jodatime as the basis for  
this.  There were some questions on how these types would be  
serialized in text files where the type information might be lost.

Olga Natkovich talked about areas the Yahoo Pig team would like to  
work on in the future, mostly focussed in the areas of usability.   
These included changing our parser to one that will allow us to give  
better error messages.  Dmitriy indicated he strongly preferred  
Antlr.  It also includes resurrecting support for the illustrate  
command, which we have let lapse.  Richard and Ashutosh noted that how  
illustrate works internally needs some redesign, because currently it  
requires special code inside each physical operator.  This makes it  
hard to maintain illustrate in the face of new operators, and pollutes  
the main code path during execution.  Instead it should be done via  
callbacks or some other solution.

After these presentations the group took on a couple of topics for  
discussion.  The first was how Pig should grow to become Turing  
complete.  For this Dmitriy and Ning Liang presented Piglet, a Ruby  
library they use at Twitter to wrap Pig and provide branching,  
looping, functions, and modules.  Several people in the group  
expressed concerns that growing Pig Latin itself to be Turing complete  
will result in a poorly thought out language with insufficient tools  
and too much maintenance in the future.  One suggestion that was made  
was to create a Java interface that allowed users to directly  
construct Pig data flows.  That is, this interface would (roughly)  
have a method for each Pig operator.  Users could then construct Pig  
data flows directly in Java.  Users who wished to use scripting  
languages could still access this with no additional work via Jython,  
JRuby, Groovy, etc.

The second discussion centered on Pig's support for workflow systems  
such as Oozie and Azkaban.  There have been proposals in the past that  
Pig switch to generate Oozie workflows instead of MR jobs.  Alan  
indicated that he does not see the value of this.  There have been  
proposals that Pig Latin be extended to include workflow controls.   
Dmitriy and Russell both indicated they thought extending Pig Latin in  
this was was a bad idea and seemed like a layer violation.  Alejandro  
Abdelnur (architect for Oozie at Yahoo) indicated he was happy with  
the interface changes being made by Richard as part of 0.8.  Alan  
indicated we need to talk with the Azkaban guys to see what would make  
integration better for them.

We ended with a few last discussion points.  Dmitriy suggested that  
Piggybank should move out of contrib into a more cpan like environment  
that was version independent.  This frees Pig contributors from  
needing to keep older UDFs up to date.  It allows users to download  
versions of the UDFs that are appropriate to the version of Pig they  
are using.  And it allows UDF contributors to more easily contribute  
their code without going through the whole patch acceptance process.   
The group indicated they were open to this approach, though no one  
volunteered to undertake setting it up.

Ashutosh asked whether there would be a 0.7.1 release since several  
important issues had been found and resolved since 0.7.0.  The Yahoo  
team (which has driven all previous releases) indicated it had no  
immediate plans to do so, but it was open to helping anyone who wanted  
to drive it.  No one volunteered.

At the end we agreed that this had been useful and we should do it on  
a more regular basis.  We also agreed that we need to find a way to  
open this up to others who do not live in the Bay Area.  Alan agreed  
to work on facilitating this.


View raw message