pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aniket Mokashi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-1631) Support to 2 level nested foreach
Date Wed, 03 Aug 2011 08:51:27 GMT

     [ https://issues.apache.org/jira/browse/PIG-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aniket Mokashi updated PIG-1631:
--------------------------------

    Attachment: MultiLevelNestedForeach1.patch

> Support to 2 level nested foreach
> ---------------------------------
>
>                 Key: PIG-1631
>                 URL: https://issues.apache.org/jira/browse/PIG-1631
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>            Assignee: Aniket Mokashi
>              Labels: gsoc2011
>             Fix For: 0.10
>
>         Attachments: MultiLevelNestedForeach1.patch, NestedForeachPatch3.txt, PIG-1631_3.patch
>
>
> What I would like to do is generate certain metrics for every listing impression in the
context of a page like clicks on the page etc. So, I first group by to get clicks and impression
together. Now, I would want to iterate through the mini-table (one per serve-id) and compute
metrics. Since nested foreach within foreach is not supported I ended up writing a UDF that
took both the bags and computed the metric. It would have been elegant to keep the logic of
iterating over the records outside in the PIG script. 
> Here is some pseudocode of how I would have liked to write it:
> {code}
> -- Let us say in our page context there was click on rank 2 for which there were 3 ads

> A1 = LOAD '...' AS (page_id, rank); -- clicks. 
> A2 = Load '...' AS (page_id, rank); -- impressions
> B = COGROUP A1 by (page_id), A2 by (page_id); 
> -- Let us say B contains the following schema 
> -- (group, {(A1...)} {(A2...)})  
> -- Each record would be in B would be:
> -- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))}
> C = FOREACH B GENERATE {
>                 D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current pig as well.
Basically, I would like a mini-table which represents an entire serve. 
>                 FOREACH D GENERATE
>                         page_id_1,
>                         A2:rank,
>                         SOMEUDF(A1:rank, A2::rank);  -- This UDF returns a value (like
v1, v2, v3 depending on A1::rank and A2::rank)
> };
> # output
> # page_id, 1, v1
> # page_id,  2, v2
> # page_id, 3, v3
> DUMP C;
> {code}
> P.S: I understand that I could have alternatively, flattened the fields of B and then
done a GROUP on page_id and then iterated through the records calling 'SOMEUDF' appropriately
but that would be 2 map-reduce operations AFAIK. 
> This is a candidate project for Google summer of code 2011. More information about the
program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message