hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Santhosh Srinivasan" <...@yahoo-inc.com>
Subject RE: What is a relation?
Date Thu, 11 Dec 2008 20:15:13 GMT
In the existing implementation, Pig has the following dilemma.

1. A subset of the relational operators are allowed inside a foreach
E.g: filter, distinct, order by

2. Non relational operators that are allowed inside a foreach are not
allowed outside a foreach
E.g: Projections: A = B.$0;, Assignments for scalars: X = COUNT(D);,
etc.

Lets assume that a relation is a bag (and vice versa too). 

Scenario 1:
-----------
In the future if there are plans to allow all operators that exist
inside a foreach outside and vice versa, we will have the following
problem:

A = load 'input';
B = COUNT(A);
C = group A by $0;
D = foreach C { X = COUNT(A); generate X:};

Is B a relation? 
Yes - is B a bag that contains tuples of longs?
No - is B a scalar of type long?

Scenario 2:
-----------

If there are no plans to allow operators inside a foreach outside (and
not vice versa).

It makes good sense to treat relations as bags and vice versa but there
are some open questions:

1. Do storage functions indicate that the stored data is a bag?
Likewise, do load functions treat the stored data as bags?
2. Will there be an equivalence of the operators wherein bags can
replace relations in all operators that support relational operator
inputs?
E.g: Pradeep alluded to the use of a bag column inside a relation with
other relational operators.
A = load 'input' as (x: int, b: {t:(a: int)});
B = filter A.b by a > 10;

Conclusion
-----------

The equivalence of bags and relations is influenced by the long term
plan of what will be legal in the language and not necessarily
influenced by the current state of the language. In the short term, it
makes sense to treat relations as bags (and vice versa in some cases).
In the long term, relations should be treated as its own type and define
legal operations on this type.

Santhosh 

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, December 11, 2008 12:10 PM
To: pig-dev@hadoop.apache.org
Subject: Re: What is a relation?

All of what you say sounds like a feature to me rather than a problem.

Yes, the implementor needs to do it right, but that kind of goes with
the
territory.

On Thu, Dec 11, 2008 at 11:32 AM, Pradeep Kamath
<pradeepk@yahoo-inc.com>wrote:

> I find it somewhat inconsistent that we treat both relations and bags
> the same.
>
> SIZE(A) where A is real bag will be different in implementation than
> SIZE(A) where A is a relation - For the former, all the data is
already
> in a container and one can just inspect the size. For the latter, you
> have to do a group ALL-COUNT - this would be very confusing from a
> backend implementation point of view.
>
> If we do treat relations and bags as equivalent, then all statements
> which currently work on relations should work on bags (say in my input
> data). Here is an example:
> A = load 'bla' as (bg:{t:(x:int, y, z)}, str:chararray);
> B = filter A.bg by x < 100; -- Directly access the bag "bg" inside A
> (which is supposed to be bag too) and filter on on it - likewise other
> operations possible on relations should work).
>
> Also A = load 'bla'; B = COUNT(A); will have to be supported
(implicitly
> by a map reduce boundary doing a group ALL -COUNT). This will be done
> under the covers and it may not be obvious to a user that and explicit
> group ALL - COUNT and a direct COUNT(A) are the same.
>
>
> Thanks,
> Pradeep
>
> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Thursday, December 11, 2008 11:12 AM
> To: pig-dev@hadoop.apache.org
> Subject: RE: What is a relation?
>
> I think we should consider Bag and relations to be the same so that we
> can handle processing in the outer script as well as inside of nested
> foreach the same and make it easier to extend the set of operators
> allowed inside of foreach block.
>
> Olga
>
> > -----Original Message-----
> > From: Alan Gates [mailto:gates@yahoo-inc.com]
> > Sent: Friday, December 05, 2008 6:04 PM
> > To: pig-dev@hadoop.apache.org
> > Subject: What is a relation?
> >
> > All,
> >
> > A question on types in pig.  When you say:
> >
> > A = load 'myfile';
> >
> > what exactly is A?  For the moment let us call A a relation,
> > since it is a set of records, and we can pass it to a
> > relational operator, such as FILTER, ORDER, etc.
> >
> > To clarify the question, is a relation equivalent to a bag?
> > In some ways it seems to be in our current semantics.
> > Certainly you can turn a relation into a bag:
> >
> > A = load 'myfile';
> > B = group A all;
> >
> > The schema of the relation B at this point is <group, A>,
> > where A is a bag.  This does not necessarily mean that a
> > relation is a bag, because an operation had to occur to turn
> > the relation into a bag (the group all).
> >
> > But bags can be turned into relations, and then treated again
> > as if they were bags:
> >
> > C = foreach B {
> >         C1 = filter A by $0 > 0;
> >         generate COUNT(C1);
> > }
> >
> > Here the bag A created in the previous grouping step is being
> > treated as it were a relation and passed to a relational
> > operator, and the resulting relation (C1) treated as a bag to
> > be passed COUNT.  So at a very minimum it seems that a bag is
> > a type of a relation, even if not all relations are bags.
> >
> > But, if top level (non-nested) relations are bags, why isn't
> > it legal to do:
> >
> > A = load 'myfile';
> > B = A.$0;
> >
> > The second statement would be legal nested inside a foreach,
> > but is not legal at the top level.
> >
> > We have been aware of this discrepancy for a while, and lived
> > with it.  But I believe it is time to resolve it.  We've
> > noticed that some parts of pig assume an equivalence between
> > bag and relation (e.g. the
> > typechecker) and other parts do not (e.g. the syntax example
> > above).
> > This inconsistency is confusing to users and developers
> > alike.  As Pig Latin matures we need to strive to make it a
> > logically coherent and complete language.
> >
> > So, thoughts on how it ought to be?
> >
> > The advantage I see for saying a relation is equivalent to a
> > bag is simplicity of the language.  There is no need to
> > introduce another data type.  And it allows full relational
> > operations to occur at both the top level and nested inside foreach.
> >
> > But this simplicity also seems me the downside.  Are we
> > decoupling the user so far from the underlying implementation
> > that he will not be able to see side effects of his actions?
> > A top level relation is assumably spread across many chunks
> > and any operation on it will require one or more map reduce
> > jobs, whereas a relation nested in a
> > foreach is contained on one node.   This also makes pig much more
> > complex, because while it may hide this level of detail from
> > the user, it clearly has to understand the difference between
> > top level and nested operations and handle both cases.
> >
> > Alan.
> >
>



-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Mime
View raw message