pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Will Lauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4608) FOREACH ... UPDATE
Date Thu, 18 Jan 2018 22:01:00 GMT

    [ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331309#comment-16331309
] 

Will Lauer commented on PIG-4608:
---------------------------------

Ok, just to close the loop, here are several examples given the new proposed syntax. I want
to make sure I understand which are correct and what the behavior is in each case.

```
/* simple projection, specifying resulting schema, using both explicit column names and positions
*/
a = FOREACH b GENERATE 1+s as x:long, $2+$3 as y:chararray, q-1 as z;
a = FOREACH b GENERATE FLATTEN(s) as (x:int, y:long, z:chararray); -- flattening tuples into
individual columns
a = FOREACH b GENERATE FLATTEN(s) as x:int, 1 as y; -- flattening bags into multiple rows

/* complex projection, specifying resulting schema, using both explicity column names and
positions */
a = FOREACH b {
    q = COUNT(s);
    r = someUdf($1,$2);
    GENERATE q as x:long, r as y;
}

/* simple update */
a = FOREACH b UPDATE q with r+s;

/* complex update */
a = FOREACH b {
    q = COUNT(s);
    r = someUdf($1, $2);
    UPDATE qprime WITH q, rprime WITH r;
}

/* simple update using positional arguments */
a = FOREACH b UPDATE $1 with r+$2;

/* simple renaming of a column */
a = FOREACH b UPDATE q as r;

/* simple schema type change */
a = FOREACH b UPDATE q WITH (int)q AS q:int; -- change q from something to int
a = FOREACH b UPDATE q AS q:int -- This should be illegal, right? If the type is changed,
an explicit modify of the value should occur

/* rename, type, and value change together */
a = FOREACH b UPDATE q WITH computeR(q) as r:long;

/* simple column drop */
a = FOREACH b DROP q,r,$5; -- drops columns q, r, and whatever is the 5th column
a = FOREACH b DROP q:int; -- This should be illegal, right? No types should be present in
a DROP statement
 
/* updating an individual field within a tuple - not implemented in the initial version */
a = FOREACH b UPDATE q.$1.fieldN WITH r+s; 

/* renaming an individual field within a tuple - not implemented in the initial version */
a = FOREACH b UPDATE q.$1.fieldN AS newFieldN; -- has the result of renaming the field within
q.$1, not renaming q or $1

/* flattening a tuple into existing fields - does this make sense?*/
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5);
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q, r, t); -- renaming one column during
flattening assignment
a = FOREACH b UPDATE (q,r,s) WITH FLATTEN($5) AS (q:int, r:chararray, s:long); -- re-typing
arguments as part of flattening

/* flattening a bag into existing fields, exploding rows in the process -- does this make
sense? */
a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol);
a = FOREACH b UPDATE f1 WITH FLATTEN(bagCol) as f2:int; -- rename field and possibly retype
as part of the flatten
```

While I admit the WITH/AS syntax is useful, it still feels a bit weird to me as a pig script
writer. I'd love to have [~kpriceyahoo] weigh in on the proposal to ensure it still makes
sense to heavy pig script writers.

> FOREACH ... UPDATE
> ------------------
>
>                 Key: PIG-4608
>                 URL: https://issues.apache.org/jira/browse/PIG-4608
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Haley Thrapp
>            Priority: Major
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do not match
an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large number of fields
(in the 20-200 range). Often, we need to only make modifications to a few fields. The FOREACH
... UPDATE statement, allows the developer to focus on the actual logical changes instead
of having to list all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe this can
be done with changes to the parser and the creation of a new LOUpdate. No physical plan changes
should be needed because we will leverage what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message