hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "pi song" <pi.so...@gmail.com>
Subject Re: What is a relation?
Date Sat, 06 Dec 2008 07:39:10 GMT
Here is an example that I have given a while ago in JIRA Pig-158 :-

A = LOAD 'fil1' ;
B = A.($0,$1) ;
STORE B ;

which is similar to your top-level projection example.

I believe there is no distinction between so-called relations and bags in
our context.

"A top level relation is assumably spread across many chunks and any
operation on it will require one or more map reduce jobs, whereas a relation
nested in a foreach is contained on one node." <== As I proposed before,
whether to run across many nodes or not should have nothing to do with
top-level or inner-level. The factor which comes into play should rather be
"job size" which is heuristically calculated.

To give users some power to control whether to run across nodes or not, we
may later on introduce a hint keyword instead. This keeps the language
simple but yet powerful if needed.

Pi


On Sat, Dec 6, 2008 at 1:04 PM, Alan Gates <gates@yahoo-inc.com> wrote:

> All,
>
> A question on types in pig.  When you say:
>
> A = load 'myfile';
>
> what exactly is A?  For the moment let us call A a relation, since it is a
> set of records, and we can pass it to a relational operator, such as FILTER,
> ORDER, etc.
>
> To clarify the question, is a relation equivalent to a bag?  In some ways
> it seems to be in our current semantics.  Certainly you can turn a relation
> into a bag:
>
> A = load 'myfile';
> B = group A all;
>
> The schema of the relation B at this point is <group, A>, where A is a bag.
>  This does not necessarily mean that a relation is a bag, because an
> operation had to occur to turn the relation into a bag (the group all).
>
> But bags can be turned into relations, and then treated again as if they
> were bags:
>
> C = foreach B {
>       C1 = filter A by $0 > 0;
>       generate COUNT(C1);
> }
>
> Here the bag A created in the previous grouping step is being treated as it
> were a relation and passed to a relational operator, and the resulting
> relation (C1) treated as a bag to be passed COUNT.  So at a very minimum it
> seems that a bag is a type of a relation, even if not all relations are
> bags.
>
> But, if top level (non-nested) relations are bags, why isn't it legal to
> do:
>
> A = load 'myfile';
> B = A.$0;
>
> The second statement would be legal nested inside a foreach, but is not
> legal at the top level.
>
> We have been aware of this discrepancy for a while, and lived with it.  But
> I believe it is time to resolve it.  We've noticed that some parts of pig
> assume an equivalence between bag and relation (e.g. the typechecker) and
> other parts do not (e.g. the syntax example above).  This inconsistency is
> confusing to users and developers alike.  As Pig Latin matures we need to
> strive to make it a logically coherent and complete language.
>
> So, thoughts on how it ought to be?
>
> The advantage I see for saying a relation is equivalent to a bag is
> simplicity of the language.  There is no need to introduce another data
> type.  And it allows full relational operations to occur at both the top
> level and nested inside foreach.
>
> But this simplicity also seems me the downside.  Are we decoupling the user
> so far from the underlying implementation that he will not be able to see
> side effects of his actions?  A top level relation is assumably spread
> across many chunks and any operation on it will require one or more map
> reduce jobs, whereas a relation nested in a foreach is contained on one
> node.   This also makes pig much more complex, because while it may hide
> this level of detail from the user, it clearly has to understand the
> difference between top level and nested operations and handle both cases.
>
> Alan.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message