hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: looking for some help with pig syntax
Date Wed, 29 Aug 2007 19:15:22 GMT
To echo these concerns, unless there is a way of
converting/re-interpreting the same underlying on-disk data - the data
structure has to be designed keeping in mind the pig operations that may
ever be applied to it. That's obviously really bad. Considering the
space/time overhead - re-interpretation would be better than conversion
(which would suggest that u might need two variants of the flatten
operator).

 

(from a programmer's perspective, there's little difference between a
bag and a tuple. They are both lists. The fact they are not treated
symmetrically makes things less elegant imho.)

 

________________________________

From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Wednesday, August 29, 2007 11:03 AM
To: hadoop-user@lucene.apache.org; Joydeep Sen Sarma
Cc: hadoop-user@lucene.apache.org
Subject: RE: looking for some help with pig syntax

 

 

How do you cause data to be interpreted one way or the other?

How do you convert from one representation to another?

WHat is the motive for treating tuples and bags differently?

Is there some special treatment of a bag of single element tuples?


-----Original Message-----
From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
Sent: Tue 8/28/2007 10:01 PM
To: Joydeep Sen Sarma
Cc: hadoop-user@lucene.apache.org
Subject: Re: looking for some help with pig syntax

Hi,

There are  2 different data types in Pig

i) Tuple: a collection of fields, like a database record
ii) Bag: collection of tuples, like a database table.

In,
> t1 = load table1 as id, listOfId;

If listOfId is a bag, flattening will give you
<1, 2>
<1, 3>
<1, 4>

If listOfId is a tuple, flattening will only remove the tuple 
wrapping and you will get
< 1, 2, 3, 4>

Assuming that listOfId is a bag, the following pig script is what you 
want

t1 = load table1 as id, listOfId;
<1, {2,3,4}>
t2 = load table2 as joinId, f1;
<2, a> < 3, b> <4, c>
t3 = foreach t1 generate id, flatten(listOfId);
<1, 2> <1, 3> <1, 4>
t4 = join t3 by $1, t2 by joinId;
< 1, 2, 2, a> < 1, 3, 3, b> <1, 4, 4, c>
t5 = foreach t4 generate id, f1;
<1,a> <1, b> <1, c>
t6 = group t5 by id;
<1, {a, b, c}>

t6 contains your result.

Utkarsh



On Aug 28, 2007, at 5:58 PM, Joydeep Sen Sarma wrote:

>
> I am misunderstanding something.
>
> following intro to pig-latin doc (p6), the flatten generating 'a' 
> would
> generate <1,2,3,4> (and not <1,2>,<1,3>,<1,4>)
>
>
> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com]
> Sent: Tuesday, August 28, 2007 12:47 PM
> To: hadoop-user@lucene.apache.org
> Cc: utkarsh@yahoo-inc.com
> Subject: Re: looking for some help with pig syntax
>
> Sorry, I misunderstood what you were trying to generate.  Perhaps the
> following will come closer:
>
> t1 = load table1 as id, listOfId; -- <1, <2,3,4>>
> t2 = load table2 as id, f1; -- <2,a>,<3,b>,<4,c>
> a = foreach t1 generate id, flatten(listOfId); -- <1,2>,<1,3>,<1,4>
> b = join a by $0, t2 by id; -- <2,1,2,2,a>,<3,1,3,3,b>,<4,1,4,4,c>
> c = group b by $1; -- <1,{<2,1,2,2,a>,<3,1,3,3,b>,<4,1,4,4,c>}>
> d = foreach d generate group, c.b::$4; -- <1, {<a>,<b>,<c>}>
>
> where <> represents a tuple and {} a bag.
>
> I'm not 100% sure of the syntax c.b::$4 for d, you may have to fiddle
> with that to get it right.
>
> Alan.
>
>
>
>
> Joydeep Sen Sarma wrote:
>> Will it?
>>
>> Trying an example:
>>
>> t1 = {<1, <2, 3, 4>>}
>> t2 = {<2, "alpha">,<3,"beta">,<4,"gamma">}
>>
>> desired outcome c = {<1, <"alpha", "beta", "gamma">} /* or
> alternatively
>> */
>>                 c = {<1, <<2,"alpha">,<3,"beta">,<4,"gamma">>>}
>>
>> but as proposed (I hope I am reading the pig document correctly):
>>
>> t1a = {<2,3,4>}
>> b = {<2, 2, "alpha">}
>>
>> // no point going further - this doesn't seem to be doing what I want
> ..
>>
>>
>> -----Original Message-----
>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>> Sent: Tuesday, August 28, 2007 10:45 AM
>> To: hadoop-user@lucene.apache.org
>> Cc: utkarsh@yahoo-inc.com
>> Subject: Re: looking for some help with pig syntax
>>
>> I think the following will do what you want.
>>
>> t1 = load table1 as id, listOfId;
>> t2 = load table2 as id, f1;
>> t1a = foreach t1 generate flatten(listOfId); -- flattens the lisOfId
>> into a set of ids
>> b = join t1a by $0, t2 by id; -- join the two together.
>> c = foreach b generate t2.id, t2.f1; -- project just the ids and f1
>> entries.
>>
>> Alan.
>>
>> Joydeep Sen Sarma wrote:
>>
>>> Specifically, how can we express this query:
>>>
>>>
>>>
>>> Table1 contains: id, (list of ids)
>>>
>>> Table2 contains: id, f1
>>>
>>>
>>>
>>> Where the Table1:list is a variable length list of foreign key (id)
>>>
>> into
>>
>>> Table2.
>>>
>>>
>>>
>>> We would like to join every element of Table1:list with 
>>> corresponding
>>> Table2:id. Ie. The final output should of the form:
>>>
>>>
>>>
>>> Table3 contains: id, (list of f1)
>>>
>>>
>>>
>>> Couldn't quite figure out how to do this - does Pig Latin support
>>>
>> nested
>>
>>> foreach loops? If there's a more appropriate mailing list - please
>>> re-direct,
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Joydeep
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message