hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Utkarsh Srivastava <utka...@yahoo-inc.com>
Subject Re: looking for some help with pig syntax
Date Wed, 29 Aug 2007 05:01:09 GMT
Hi,

There are  2 different data types in Pig

i) Tuple: a collection of fields, like a database record
ii) Bag: collection of tuples, like a database table.

In,
> t1 = load table1 as id, listOfId;

If listOfId is a bag, flattening will give you
<1, 2>
<1, 3>
<1, 4>

If listOfId is a tuple, flattening will only remove the tuple  
wrapping and you will get
< 1, 2, 3, 4>

Assuming that listOfId is a bag, the following pig script is what you  
want

t1 = load table1 as id, listOfId;
<1, {2,3,4}>
t2 = load table2 as joinId, f1;
<2, a> < 3, b> <4, c>
t3 = foreach t1 generate id, flatten(listOfId);
<1, 2> <1, 3> <1, 4>
t4 = join t3 by $1, t2 by joinId;
< 1, 2, 2, a> < 1, 3, 3, b> <1, 4, 4, c>
t5 = foreach t4 generate id, f1;
<1,a> <1, b> <1, c>
t6 = group t5 by id;
<1, {a, b, c}>

t6 contains your result.

Utkarsh



On Aug 28, 2007, at 5:58 PM, Joydeep Sen Sarma wrote:

>
> I am misunderstanding something.
>
> following intro to pig-latin doc (p6), the flatten generating 'a'  
> would
> generate <1,2,3,4> (and not <1,2>,<1,3>,<1,4>)
>
>
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Tuesday, August 28, 2007 12:47 PM
> To: hadoop-user@lucene.apache.org
> Cc: utkarsh@yahoo-inc.com
> Subject: Re: looking for some help with pig syntax
>
> Sorry, I misunderstood what you were trying to generate.  Perhaps the
> following will come closer:
>
> t1 = load table1 as id, listOfId; -- <1, <2,3,4>>
> t2 = load table2 as id, f1; -- <2,a>,<3,b>,<4,c>
> a = foreach t1 generate id, flatten(listOfId); -- <1,2>,<1,3>,<1,4>
> b = join a by $0, t2 by id; -- <2,1,2,2,a>,<3,1,3,3,b>,<4,1,4,4,c>
> c = group b by $1; -- <1,{<2,1,2,2,a>,<3,1,3,3,b>,<4,1,4,4,c>}>
> d = foreach d generate group, c.b::$4; -- <1, {<a>,<b>,<c>}>
>
> where <> represents a tuple and {} a bag.
>
> I'm not 100% sure of the syntax c.b::$4 for d, you may have to fiddle
> with that to get it right.
>
> Alan.
>
>
>
>
> Joydeep Sen Sarma wrote:
>> Will it?
>>
>> Trying an example:
>>
>> t1 = {<1, <2, 3, 4>>}
>> t2 = {<2, "alpha">,<3,"beta">,<4,"gamma">}
>>
>> desired outcome c = {<1, <"alpha", "beta", "gamma">} /* or
> alternatively
>> */
>>                 c = {<1, <<2,"alpha">,<3,"beta">,<4,"gamma">>>}
>>
>> but as proposed (I hope I am reading the pig document correctly):
>>
>> t1a = {<2,3,4>}
>> b = {<2, 2, "alpha">}
>>
>> // no point going further - this doesn't seem to be doing what I want
> ..
>>
>>
>> -----Original Message-----
>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>> Sent: Tuesday, August 28, 2007 10:45 AM
>> To: hadoop-user@lucene.apache.org
>> Cc: utkarsh@yahoo-inc.com
>> Subject: Re: looking for some help with pig syntax
>>
>> I think the following will do what you want.
>>
>> t1 = load table1 as id, listOfId;
>> t2 = load table2 as id, f1;
>> t1a = foreach t1 generate flatten(listOfId); -- flattens the lisOfId
>> into a set of ids
>> b = join t1a by $0, t2 by id; -- join the two together.
>> c = foreach b generate t2.id, t2.f1; -- project just the ids and f1
>> entries.
>>
>> Alan.
>>
>> Joydeep Sen Sarma wrote:
>>
>>> Specifically, how can we express this query:
>>>
>>>
>>>
>>> Table1 contains: id, (list of ids)
>>>
>>> Table2 contains: id, f1
>>>
>>>
>>>
>>> Where the Table1:list is a variable length list of foreign key (id)
>>>
>> into
>>
>>> Table2.
>>>
>>>
>>>
>>> We would like to join every element of Table1:list with  
>>> corresponding
>>> Table2:id. Ie. The final output should of the form:
>>>
>>>
>>>
>>> Table3 contains: id, (list of f1)
>>>
>>>
>>>
>>> Couldn't quite figure out how to do this - does Pig Latin support
>>>
>> nested
>>
>>> foreach loops? If there's a more appropriate mailing list - please
>>> re-direct,
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Joydeep
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>


Mime
View raw message