hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Viraj Bhat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1252) Diamond splitter does not generate correct results when using Multi-query optimization
Date Tue, 02 Mar 2010 21:43:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840339#action_12840339
] 

Viraj Bhat commented on PIG-1252:
---------------------------------

A modified version of the script works, does this have to do with nested foreach? 

{code}
loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2,
col3, col4, col5, col6, col7');

prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3,
(chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : ''))
as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;

SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp IF (validRec
== '1' AND splitcond == '');

grpData = GROUP trueDataTmp BY splitcond;

finalData = FOREACH grpData GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt',
'input.dat','20100222','5', 'debug_on')) as (s,m,l);
                             
dump finalData;
{code}

> Diamond splitter does not generate correct results when using Multi-query optimization
> --------------------------------------------------------------------------------------
>
>                 Key: PIG-1252
>                 URL: https://issues.apache.org/jira/browse/PIG-1252
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> I have script which uses split but somehow does not use one of the split branch. The
skeleton of the script is as follows
> {code}
> loadData = load '/user/viraj/zebradata' using org.apache.hadoop.zebra.pig.TableLoader('col1,col2,
col3, col4, col5, col6, col7');
> prjData = FOREACH loadData GENERATE (chararray) col1, (chararray) col2, (chararray) col3,
(chararray) ((col4 is not null and col4 != '') ? col4 : ((col5 is not null) ? col5 : ''))
as splitcond, (chararray) (col6 == 'c' ? 1 : IS_VALID ('200', '0', '0', 'input.txt')) as validRec;
> SPLIT prjData INTO trueDataTmp IF (validRec == '1' AND splitcond != ''), falseDataTmp
IF (validRec == '1' AND splitcond == '');
> grpData = GROUP trueDataTmp BY splitcond;
> finalData = FOREACH grpData {
>                                orderedData = ORDER trueDataTmp BY col1,col2;
>                                GENERATE FLATTEN ( MYUDF (orderedData, 60, 1800, 'input.txt',
'input.dat','20100222','5', 'debug_on')) as (s,m,l);
>                               }
> dump finalData;
> {code}
> You can see that "falseDataTmp" is untouched.
> When I run this script with no-Multiquery (-M) option I get the right result.  This could
be the result of complex BinCond's in the POLoad. We can get rid of this error by using  FILTER
instead of SPIT.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message