pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Will Lauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4608) FOREACH ... UPDATE
Date Sat, 13 Jan 2018 05:59:00 GMT

    [ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324986#comment-16324986

Will Lauer commented on PIG-4608:

I've gone ahead and made a patch to implement a version of this functionality as a starting
point for discussion. Once I figure out how to upload it to reviewboard, everyone can take
a look at it.

There are several requirements that we have here:
# Need to modify values of arbitrary fields 
#* without having to specify every field
#* without the field order changing unexpectedly
#* without having to know the current index to the field
# Need to remove fields
#* without having to know the index of the field
#* without reordering the rest of the fields
# Need ability to change the type of a field
# Ability to reference a field without specifying its disambiguating join prefix when field
is unambiguous
# Update must support the FOREACH nested block syntax 

Additionally, I agree with Rohini that "strict" mode is required to prevent typos from causing
scripts to run with the unexpected behaviors of adding  a new column instead of modifying
an existing one).

While nice to have, being able to specify adds, deletes, and updates all in the same statement
isn't a strict requirement, as that can be done simply with multiple successive FOREACH statements.

The syntax that I've made work is
a = load 'input' using mock.Storage() as (x:chararray, y:chararray, z:long);
b = foreach a generate x+y as q, y, z:long;
c = foreach a update "prefix"+x as x, (chararray)(z+1) as z:charrarray;
d = foreach a delete x, z;
e = foreach a {
           nextInt = z+1;
           update nextInt as z:int

To me, the ... syntax seems weird, so I've gone with seprate UPDATE and DELETE commands. For
clarity, only a single command can exist per statement (no foreach update a, delete b). Similarly,
there is no support for appending columns, as that is easily accomplished already with 
b = foreach a generate *, a+5 as  newCol:chararray;

> ------------------
>                 Key: PIG-4608
>                 URL: https://issues.apache.org/jira/browse/PIG-4608
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Haley Thrapp
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do not match
an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large number of fields
(in the 20-200 range). Often, we need to only make modifications to a few fields. The FOREACH
... UPDATE statement, allows the developer to focus on the actual logical changes instead
of having to list all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe this can
be done with changes to the parser and the creation of a new LOUpdate. No physical plan changes
should be needed because we will leverage what LOGenerate does.

This message was sent by Atlassian JIRA

View raw message