pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacob Tolar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-4608) FOREACH ... UPDATE
Date Sat, 20 Jun 2015 05:06:00 GMT

    [ https://issues.apache.org/jira/browse/PIG-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594401#comment-14594401
] 

Jacob Tolar commented on PIG-4608:
----------------------------------

Hi Rohini, are you suggesting this:

{code}
updated = FOREACH three_numbers GENERATE
   ...,
   5 as f1,
   ...,
   f1+f2 as new_sum;
{code}

?

Here's an exaggerated example of why we think something like foreach .. update would work
better. Original pig script:

{code}
-- assume we are using the schema load option ( http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
) 
-- with fields named f1, f2, ..., f50
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i generate
  f1, 
  f2, 
  3 as f3, 
  f4, 
  f5, 
  6 as f6,  
  -- ... you get the idea, we're updating every 3rd field for some reason
  48 as f48,
  f49,
  f50;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}

Here it is with project-range notation that exists in pig. In this particularly nasty case
we are still mentioning every single field, even though we're using project-range:
{code}
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i generate
  f1..f2,
  3 as f3, 
  f4..f5,
  6 as f6, 
  -- etc
  48 as f48,
  f49..f50;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}

I think this is what you're suggesting. It's a little better than the project-range but still
not great (lots of extra dots): 
{code}
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i generate
  ...,
  3 as f3, 
  ...,
  6 as f6, 
  ...,
  9 as f9, 
  -- etc
  48 as f48,
  ...;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}

With foreach ... update, we only need to list the fields that are changing.

{code}
i = load '/path/to/data' USING PigStorage();
intermediate = foreach i update
  3 as f3, 
  6 as f6, 
  9 as f9, 
  -- etc
  48 as f48;
store intermediate into '/path/to/output' USING PigStorage(',');
{code}

The last one is much clearer (if 'foreach update' has clearly defined semantics) and is also
the shortest because it has the least extra syntactic overhead: you only need to type exactly
what you want, nothing more. That makes it easier to write, easier to read later, and (we
believe...but we can't use it yet :)) less prone to error.

> FOREACH ... UPDATE
> ------------------
>
>                 Key: PIG-4608
>                 URL: https://issues.apache.org/jira/browse/PIG-4608
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Haley Thrapp
>
> I would like to propose a new command in Pig, FOREACH...UPDATE.
> Syntactically, it would look much like FOREACH … GENERATE.
> Example:
> Input data:
> (1,2,3)
> (2,3,4)
> (3,4,5)
> -- Load the data
> three_numbers = LOAD 'input_data'
> USING PigStorage()
> AS (f1:int, f2:int, f3:int);
> -- Sum up the row
> updated = FOREACH three_numbers UPDATE
> 5 as f1,
> f1+f2 as new_sum
> ;
> Dump updated;
> (5,2,3,3)
> (5,3,4,5)
> (5,4,5,7)
> Fields to update must be specified by alias. Any fields in the UPDATE that do not match
an existing field will be appended to the end of the tuple.
> This command is particularly desirable in scripts that deal with a large number of fields
(in the 20-200 range). Often, we need to only make modifications to a few fields. The FOREACH
... UPDATE statement, allows the developer to focus on the actual logical changes instead
of having to list all of the fields that are also being passed through.
> My team has prototyped this with changes to FOREACH ... GENERATE. We believe this can
be done with changes to the parser and the creation of a new LOUpdate. No physical plan changes
should be needed because we will leverage what LOGenerate does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message