hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject What is a relation?
Date Sat, 06 Dec 2008 02:04:27 GMT
All,

A question on types in pig.  When you say:

A = load 'myfile';

what exactly is A?  For the moment let us call A a relation, since it  
is a set of records, and we can pass it to a relational operator,  
such as FILTER, ORDER, etc.

To clarify the question, is a relation equivalent to a bag?  In some  
ways it seems to be in our current semantics.  Certainly you can turn  
a relation into a bag:

A = load 'myfile';
B = group A all;

The schema of the relation B at this point is <group, A>, where A is  
a bag.  This does not necessarily mean that a relation is a bag,  
because an operation had to occur to turn the relation into a bag  
(the group all).

But bags can be turned into relations, and then treated again as if  
they were bags:

C = foreach B {
        C1 = filter A by $0 > 0;
        generate COUNT(C1);
}

Here the bag A created in the previous grouping step is being treated  
as it were a relation and passed to a relational operator, and the  
resulting relation (C1) treated as a bag to be passed COUNT.  So at a  
very minimum it seems that a bag is a type of a relation, even if not  
all relations are bags.

But, if top level (non-nested) relations are bags, why isn't it legal  
to do:

A = load 'myfile';
B = A.$0;

The second statement would be legal nested inside a foreach, but is  
not legal at the top level.

We have been aware of this discrepancy for a while, and lived with  
it.  But I believe it is time to resolve it.  We've noticed that some  
parts of pig assume an equivalence between bag and relation (e.g. the  
typechecker) and other parts do not (e.g. the syntax example above).   
This inconsistency is confusing to users and developers alike.  As  
Pig Latin matures we need to strive to make it a logically coherent  
and complete language.

So, thoughts on how it ought to be?

The advantage I see for saying a relation is equivalent to a bag is  
simplicity of the language.  There is no need to introduce another  
data type.  And it allows full relational operations to occur at both  
the top level and nested inside foreach.

But this simplicity also seems me the downside.  Are we decoupling  
the user so far from the underlying implementation that he will not  
be able to see side effects of his actions?  A top level relation is  
assumably spread across many chunks and any operation on it will  
require one or more map reduce jobs, whereas a relation nested in a  
foreach is contained on one node.   This also makes pig much more  
complex, because while it may hide this level of detail from the  
user, it clearly has to understand the difference between top level  
and nested operations and handle both cases.

Alan.

Mime
View raw message