pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang_intel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-4345) e2e test "RubyUDFs_13" fails because of the different result of "group a all" in different engines like "spark", "mapreduce"
Date Thu, 27 Nov 2014 02:11:12 GMT

     [ https://issues.apache.org/jira/browse/PIG-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

liyunzhang_intel updated PIG-4345:
----------------------------------
    Description: 
RubyUDFs e2e scrip is on the line 3818 of nightly.conf : 
{code}
                    'num' => 13,
                    'java_params' => ['-Dpig.accumulative.batchsize=5'],
                    'pig' => q\
register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
b = foreach (group a all) generate FLATTEN(myfuncs.AppendIndex(a));
store b into ':OUTPATH:';\,
                    'verify_pig_script' => q\
register :FUNCPATH:/testudf.jar;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
b = foreach (group a all) generate FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a));
store b into ':OUTPATH:';\,
                    },
                ]
            },
{code}
RubyUDFs_13.pig tests ruby udf "AppendIndex" in "morerubyudfs.rb".  The output is compared
with verified script which use java udf "org.apache.pig.test.udf.evalfunc.AppendIndex". The
output of "RubyUDFs_13.pig" is like following:

If test file “studemttab10k” is 

tom thompson	42	0.53
nick johnson	34	0.47
priscilla falkner	55	1.16

the result in spark engine will be:
tom thompson	42	0.53   1
nick johnson	34	0.47   2
priscilla falkner	55	1.16  3


the result in mapreduce engine which verified script uses  will be 
priscilla falkner	55	1.16  1
nick johnson	34	0.47  2
tom thompson	42	0.53  3

The difference between the result in spark and mapreduce engine cause RubyUDFs_13 e2e test
failure .
The root cause of the difference is because “group a all” has  different result in different
engines. 
 In Spark engine, “group a all” :
all { (tom thompson	42	0.53),( nick johnson	34	0.47),( priscilla falkner	55	1.16)}
In mapreduce engine , “group a all”:
all {( priscilla falkner	55	1.16), ( nick johnson	34	0.47),(tom thompson	42	0.53)}

Using PIG-4345.patch, RubyUDF_13 e2e test passes.
{code}
{
                    'num' => 13,
                    'java_params' => ['-Dpig.accumulative.batchsize=5'],
                    'pig' => q\
register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
a1 = filter a by name == 'nick johnson';
a2 = filter a1 by age == 34;
b =  foreach (group a2 all) generate FLATTEN(myfuncs.AppendIndex(a2));
store b into ':OUTPATH:';\,
                    'verify_pig_script' => q\
register :FUNCPATH:/testudf.jar;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
a1 = filter a by name == 'nick johnson';
a2 = filter a1 by age == 34;
b =  foreach (group a2 all) generate FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a2));
store b into ':OUTPATH:';\,
                    },
                ]
            },
{code}

using PIG-4345.patch, the result in spark and mapreduce engine will be:

nick johnson	34	0.47  1


  was:
RubyUDFs e2e scrip is on the line 3818 of nightly.conf : 
{code}
                    'num' => 13,
                    'java_params' => ['-Dpig.accumulative.batchsize=5'],
                    'pig' => q\
register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
b = foreach (group a all) generate FLATTEN(myfuncs.AppendIndex(a));
store b into ':OUTPATH:';\,
                    'verify_pig_script' => q\
register :FUNCPATH:/testudf.jar;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
b = foreach (group a all) generate FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a));
store b into ':OUTPATH:';\,
                    },
                ]
            },
{code}
RubyUDFs_13.pig tests ruby udf "AppendIndex" in "morerubyudfs.rb".  The output is compared
with verified script which use java udf "org.apache.pig.test.udf.evalfunc.AppendIndex". The
output of "RubyUDFs_13.pig" is like following:

If test file “studemttab10k” is 

tom thompson	42	0.53
nick johnson	34	0.47
priscilla falkner	55	1.16

the result in spark engine will be:
tom thompson	42	0.53   1
nick johnson	34	0.47   2
priscilla falkner	55	1.16  3


the result in mapreduce engine which verified script uses  will be 
priscilla falkner	55	1.16  1
nick johnson	34	0.47  2
tom thompson	42	0.53  3

The difference between the result in spark and mapreduce engine cause RubyUDFs_13 e2e test
failure .
The root cause of the difference is because “group a all” has  different result in different
engines. 
 In Spark engine, “group a all” :
all { (tom thompson	42	0.53),( nick johnson	34	0.47),( priscilla falkner	55	1.16)}
In mapreduce engine , “group a all”:
all {( priscilla falkner	55	1.16), ( nick johnson	34	0.47),(tom thompson	42	0.53)}

If the test script is modified like following, RubyUDF_13 e2e test passes.
{code}
{
                    'num' => 13,
                    'java_params' => ['-Dpig.accumulative.batchsize=5'],
                    'pig' => q\
register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
a1 = filter a by name == 'nick johnson';
a2 = filter a1 by age == 34;
b =  foreach (group a2 all) generate FLATTEN(myfuncs.AppendIndex(a2));
store b into ':OUTPATH:';\,
                    'verify_pig_script' => q\
register :FUNCPATH:/testudf.jar;
a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
a1 = filter a by name == 'nick johnson';
a2 = filter a1 by age == 34;
b =  foreach (group a2 all) generate FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a2));
store b into ':OUTPATH:';\,
                    },
                ]
            },
{code}

using modified test script, the result in spark and mapreduce engine will be:

nick johnson	34	0.47  1



> e2e test "RubyUDFs_13" fails because of the different result of "group a all" in different
engines like "spark", "mapreduce"
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-4345
>                 URL: https://issues.apache.org/jira/browse/PIG-4345
>             Project: Pig
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: PIG-4345.patch
>
>
> RubyUDFs e2e scrip is on the line 3818 of nightly.conf : 
> {code}
>                     'num' => 13,
>                     'java_params' => ['-Dpig.accumulative.batchsize=5'],
>                     'pig' => q\
> register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs;
> a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
> b = foreach (group a all) generate FLATTEN(myfuncs.AppendIndex(a));
> store b into ':OUTPATH:';\,
>                     'verify_pig_script' => q\
> register :FUNCPATH:/testudf.jar;
> a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
> b = foreach (group a all) generate FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a));
> store b into ':OUTPATH:';\,
>                     },
>                 ]
>             },
> {code}
> RubyUDFs_13.pig tests ruby udf "AppendIndex" in "morerubyudfs.rb".  The output is compared
with verified script which use java udf "org.apache.pig.test.udf.evalfunc.AppendIndex". The
output of "RubyUDFs_13.pig" is like following:
> If test file “studemttab10k” is 
> tom thompson	42	0.53
> nick johnson	34	0.47
> priscilla falkner	55	1.16
> the result in spark engine will be:
> tom thompson	42	0.53   1
> nick johnson	34	0.47   2
> priscilla falkner	55	1.16  3
> the result in mapreduce engine which verified script uses  will be 
> priscilla falkner	55	1.16  1
> nick johnson	34	0.47  2
> tom thompson	42	0.53  3
> The difference between the result in spark and mapreduce engine cause RubyUDFs_13 e2e
test failure .
> The root cause of the difference is because “group a all” has  different result in
different engines. 
>  In Spark engine, “group a all” :
> all { (tom thompson	42	0.53),( nick johnson	34	0.47),( priscilla falkner	55	1.16)}
> In mapreduce engine , “group a all”:
> all {( priscilla falkner	55	1.16), ( nick johnson	34	0.47),(tom thompson	42	0.53)}
> Using PIG-4345.patch, RubyUDF_13 e2e test passes.
> {code}
> {
>                     'num' => 13,
>                     'java_params' => ['-Dpig.accumulative.batchsize=5'],
>                     'pig' => q\
> register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs;
> a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
> a1 = filter a by name == 'nick johnson';
> a2 = filter a1 by age == 34;
> b =  foreach (group a2 all) generate FLATTEN(myfuncs.AppendIndex(a2));
> store b into ':OUTPATH:';\,
>                     'verify_pig_script' => q\
> register :FUNCPATH:/testudf.jar;
> a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double);
> a1 = filter a by name == 'nick johnson';
> a2 = filter a1 by age == 34;
> b =  foreach (group a2 all) generate FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a2));
> store b into ':OUTPATH:';\,
>                     },
>                 ]
>             },
> {code}
> using PIG-4345.patch, the result in spark and mapreduce engine will be:
> nick johnson	34	0.47  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message