pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "GSoc2011" by daijy
Date Mon, 14 Mar 2011 22:00:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "GSoc2011" page has been changed by daijy.


New page:
= Google Summer of Code 2011 for Pig =

Pig is exciting! Pig provide an intuitive way to program hadoop. Inside Yahoo!, more than
80% of hadoop jobs are Pig jobs. It is heavily used in Twitter, Linkedin and lots of other
organizations (http://wiki.apache.org/pig/PoweredBy).

"When I say Hadoop, I really mean Pig" -- Milind Bhandarkar from Linkedin

Last year, Pig participate Google Summer of Code for the first time. We get one student (Gianmarco)
work on [[https://issues.apache.org/jira/browse/PIG-1295|raw comparator secondary sort]].
It turns out to be very successful and we adopt his code in our code base.

This year, we will participate again. Here we picked up a list of highly desired projects
for students. All these projects are doable within the scope of GSOC program. Once accepted,
we will assign a dedicated mentor to guide you through different stages of the program. We
need your help and you will get a great experience of participating an open source project.

== Project List ==

=== Nested foreach statement(https://issues.apache.org/jira/browse/PIG-1631) ===
Pig support DISTINCT, FILTER, LIMIT, and ORDER BY inside nested foreach statement(http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#FOREACH,
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#nestedblock). However, ForEach is highly
desired. For example, we need to do query like:

sessionByUser = group session by user;
b = foreach sessionByUser {
    b1 = foreach session generate accumulateSession(group, session);
    generate group, b1;

Though some of the functionality can be achieved by other approaches (Accumulator, UDF with
bag, query rewrite, etc), Nested foreach offers simplicity and additional opportunity for
optimization (Optimization is not part of this project).

=== Syntax sugar ===
We'd like to add several syntax sugar (May pick 2-3 within the list)
 * Default split destination
[[http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#SPLIT|Split statement]] is better to
have a default destination, eg:
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6), OTHER otherwise; -- OTHERS
has all tuples with f1>=7 && f2!=5 && f3==6

 * TOMAP, TOTUPLE, TOBAG syntax support:
Pig has TOMAP, TOTUPLE, TOBAG UDF. However, it will be much easier if we can add syntax support
to it:
b = foreach a generate [a0#b0] as m;
b = foreach a generate (a0, a1) as t1;
b = foreach a generate {(a0)} as b1;  -- b1 is a single tuple bag

 * Limit/Sample with a nonconstant (https://issues.apache.org/jira/browse/PIG-1713)
Currently, [[http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#LIMIT|Limit]], [[http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#SAMPLE|Sample]]
only takes a constant. It would be better we can use a [[http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Scalars|scalar]]
in the place of constant. Eg:
a = load 'a.txt';
b = group a by all;
c = foreach b generate COUNT(*) as sum;
d = order a by $0;
e = limit d c.sum/100;

 * More sampling algorithm
Currently, sample statement only support for simple random sampling. It is better we can support
more (stratified sampling, bootstrap sample, etc)

=== New DataType:Boolean(https://issues.apache.org/jira/browse/PIG-1429) ===
Pig does not support boolean [[http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Data+Types|data
type]] yet. The only exception is user can define a UDF of boolean type. However, in the follow
up processing, user might encounter [[https://issues.apache.org/jira/browse/PIG-1097|errors]].
This project include:
 * Syntax support for boolean (declare boolean type in as clause, cast into boolean, support
boolean constant)
 * Backend support (all expression can handle boolean, shuffle key can be boolean)

=== Heuristics for default parallel ===
In Pig script, user can specify the number of reduce in several ways (paralle statement, default_parallel,
"mapred.reduce.tasks" properties). If user don't specify the number of reduce in script, Pig
will use a [[https://issues.apache.org/jira/browse/PIG-1249|simple heuristics]] to calculate
the reduce number. However, this algorithm can be improved. We can use the information like
the nature of the operator, the individual size of the inputs etc to get a better estimation.

=== New Join Type: indexed join ===
Pig has a [[http://wiki.apache.org/pig/JoinFramework|join framework]] and currently support
[[http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#JOIN+%28inner%29|normal, replicated,
skewed and merge join]]). If one of the input is indexed (eg, HBase table), we can use this
fact and do a mapside join. 

== Getting start ==
First, you need to learn PigLatin language. The best source for learning PigLatin is:
 * [[http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html|Pig Latin Reference]]
 * [[http://infolab.stanford.edu/~olston/publications/sigmod08.pdf|Pig Latin paper at SIGMOD
 * [[http://wiki.apache.org/pig/PigTutorial|Pig Tutorial]]

Be sure to sign up [[http://pig.apache.org/mailing_lists.html|pig mailing list]].

Then checkout Pig source code using svn:
svn co http://svn.apache.org/repos/asf/pig/trunk

Set up environment for Eclipse:

Learn more about Pig internal at [[http://infolab.stanford.edu/~olston/publications/vldb09.pdf|Pig
paper at VLDB 2009]].

Browse through Pig code. Some good start points are:
 * [[http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/parser/QueryLexer.g|QueryLexer.g]],
Pig parser, LogicalPlan construction
 * [[http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java|LogToPhyTranslationVisitor]]:
From logical plan to physical plan
 * [[http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java|MRCompiler]]:
From physical plan to map-reduce plan
 * [[http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java|JobControlCompiler]]:
From map-reduce plan to hadoop job
 * [[http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java|MapReduceLauncher]]:
Hadoop launcher
 * [[http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigMapBase.java|PigMapBase]]:
map class for Pig
 * [[http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigMapReduce.java|PigMapReduce]]:
reduce class for Pig

== How to Apply ==
 * Follow [[http://code.google.com/soc/|GSoc]] instruction to apply. Please apply to Apache
Software Foundation. 
 * Keep [[http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs#timeline|timeline]]
in mind.
 * It is highly recommend to discuss your interest before you apply. The best way to discuss
is to comment on individual Jira or send mail to dev list.

View raw message