hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "pi song" <pi.so...@gmail.com>
Subject Re: More streamlined schema definition syntax ?
Date Wed, 28 May 2008 07:32:15 GMT
1) If we want to allow bags of anything in the future, there are things like
this:-

A = FOREACH B GENERATE B.$0, B.$1   ;

which is already accessing tuple content directly. I'm not sure if there are
also others.

2) For the perl bit, I think "optionally" omitting bag keyword is not a
problem. We already have different brackets meaning different things: ( ) ,
{   } , [   ]. That's why I think forcing TUPLE, BAG, MAP is redundant.

Pi

On 5/28/08, Alan Gates <gates@yahoo-inc.com> wrote:
>
> A couple of thoughts:
>
> The issue with removing the tuple keyword from bag definition, so we can
> have bag: {a: int} instead of bag: {tuple: (a: int)}, is we had discussed
> allowing bags to be bags of anything, instead of bags of tuples.  We aren't
> doing anything about that now, but we might in the future.  We would have to
> change the semantics on bag type declaration if we made that change.
>  Otherwise we would not know whether bag {a: int} meant that we had a bag of
> tuples of one element or a bag of ints.
>
> As for letting {} alone mean bag, I'm concerned pig latin will end up like
> perl, where different brackets mean different things and it's hard to read
> the code.  The other extreme is ending up like sql where it takes way too
> many keywords to do something.  I'm open to others views on this.
>
> Alan.
>
> pi song wrote:
>
>> Here is what I know:-
>>
>> Tuple Schema = schema associated with "a" tuple
>> Bag Schema = schema of all tuples contained in a bag
>>
>> Then, here is the current way to specify schema in PigType branch:-
>>
>> A = LOAD 'file1' AS (fieldA: bag
>> {tuple1:tuple(a:int,b:long,c:float,d:double)}, fieldB: Int)
>>
>> Isn't this inefficient? Since we have already agreed that a bag only
>> contains tuples, not datum, I think it would be better if users can do
>> just:-
>>
>> A = LOAD 'file1' AS (fieldA: bag {a:int,b:long,c:float,d:double}, fieldB:
>> Int)
>>
>> Or even better, due to the fact that the curly braces already indicate Bag
>> data type:-
>>
>> A = LOAD 'file1' AS (fieldA: {a:int,b:long,c:float,d:double}, fieldB: Int)
>>
>> So potentially I think the keyword "Bag" should be optional for
>> convenience.
>> This is the same as when we specify tuple schema which is already
>> indicated
>> by round brackets.
>>
>> Any opinion? It's now time to make it easy for users.
>>
>> Pi
>>
>> PS. I'm willing to make the change if everybody is too busy.
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message