spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "JESSE CHEN (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set
Date Fri, 18 Mar 2016 02:25:33 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200850#comment-15200850
] 

JESSE CHEN edited comment on SPARK-13865 at 3/18/16 2:24 AM:
-------------------------------------------------------------

Hive, big sql, db2 queries are all generated off corresponding query templates. Hive apparently
generated the one I listed in the initial report (with JOINs). So this is what I am asking
TPC to see why the variant exists in the templates.

Meanwhile, I tested the query you found, and as expected, Spark SQL isn't able to parse it
with the following errors:
Query:
{noformat}
select count(*)
from ((select distinct c_last_name, c_first_name, d_date
       from store_sales, date_dim, customer
       where store_sales.ss_sold_date_sk = date_dim.d_date_sk
         and store_sales.ss_customer_sk = customer.c_customer_sk
         and d_month_seq between 1200 and 1200+11)
       except
      (select distinct c_last_name, c_first_name, d_date
       from catalog_sales, date_dim, customer
       where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
         and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
         and d_month_seq between 1200 and 1200+11)
       except
      (select distinct c_last_name, c_first_name, d_date
       from web_sales, date_dim, customer
       where web_sales.ws_sold_date_sk = date_dim.d_date_sk
         and web_sales.ws_bill_customer_sk = customer.c_customer_sk
         and d_month_seq between 1200 and 1200+11)
) cool_cust
;

{noformat}
Error:
{noformat}
16/03/17 19:17:57 INFO parse.ParseDriver: Parsing command: explain select count(*)  from ((select
distinct c_last_name, c_first_name, d_date        from store_sales, date_dim, customer   
    where store_sales.ss_sold_date_sk = date_dim.d_date_sk          and store_sales.ss_customer_sk
= customer.c_customer_sk          and d_month_seq between 1200 and 1200+11)        except
      (select distinct c_last_name, c_first_name, d_date        from catalog_sales, date_dim,
customer        where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk          and catalog_sales.cs_bill_customer_sk
= customer.c_customer_sk          and d_month_seq between 1200 and 1200+11)        except
      (select distinct c_last_name, c_first_name, d_date        from web_sales, date_dim,
customer        where web_sales.ws_sold_date_sk = date_dim.d_date_sk          and web_sales.ws_bill_customer_sk
= customer.c_customer_sk          and d_month_seq between 1200 and 1200+11) ) cool_cust
NoViableAltException(296@[150:5: ( ( Identifier LPAREN )=> partitionedTableFunction | tableSource
| subQuerySource | virtualTableSource )])
	at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
	at org.antlr.runtime.DFA.predict(DFA.java:144)
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3711)
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1873)
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1518)
{noformat}




was (Author: jfchen@us.ibm.com):
Hive, big sql, db2 queries are all generated off corresponding query templates. Hive apparently
generated the one I listed in the initial report (with JOINs). So this is what I am asking
TPC to see why the variant exists in the templates.

Meanwhile, I tested the query, and as expected, Spark SQL isn't able to parse it with the
following errors:

{noformat}
16/03/17 19:17:57 INFO parse.ParseDriver: Parsing command: explain select count(*)  from ((select
distinct c_last_name, c_first_name, d_date        from store_sales, date_dim, customer   
    where store_sales.ss_sold_date_sk = date_dim.d_date_sk          and store_sales.ss_customer_sk
= customer.c_customer_sk          and d_month_seq between 1200 and 1200+11)        except
      (select distinct c_last_name, c_first_name, d_date        from catalog_sales, date_dim,
customer        where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk          and catalog_sales.cs_bill_customer_sk
= customer.c_customer_sk          and d_month_seq between 1200 and 1200+11)        except
      (select distinct c_last_name, c_first_name, d_date        from web_sales, date_dim,
customer        where web_sales.ws_sold_date_sk = date_dim.d_date_sk          and web_sales.ws_bill_customer_sk
= customer.c_customer_sk          and d_month_seq between 1200 and 1200+11) ) cool_cust
NoViableAltException(296@[150:5: ( ( Identifier LPAREN )=> partitionedTableFunction | tableSource
| subQuerySource | virtualTableSource )])
	at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
	at org.antlr.runtime.DFA.predict(DFA.java:144)
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3711)
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1873)
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1518)
{noformat}



> TPCDS query 87 returns wrong results compared to TPC official result set 
> -------------------------------------------------------------------------
>
>                 Key: SPARK-13865
>                 URL: https://issues.apache.org/jira/browse/SPARK-13865
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: JESSE CHEN
>              Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared to official
result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +-------+
> |     1 |
> +-------+
> | 47298 |
> +-------+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed QUALIFICATION
> select count(*) 
> from 
>      (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as ddate1, 1
as notnull1
>        from store_sales
>         JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>         JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>        where
>          d_month_seq between 1200 and 1200+11
>        ) tmp1
>        left outer join
>       (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as ddate2, 1
as notnull2
>        from catalog_sales
>         JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>         JOIN customer ON catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
>        where 
>          d_month_seq between 1200 and 1200+11
>        ) tmp2 
>       on (tmp1.cln1 = tmp2.cln2)
>       and (tmp1.cfn1 = tmp2.cfn2)
>       and (tmp1.ddate1= tmp2.ddate2)
>        left outer join
>       (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as ddate3,
1 as notnull3
>        from web_sales
>         JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>         JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk
>        where 
>          d_month_seq between 1200 and 1200+11
>        ) tmp3 
>       on (tmp1.cln1 = tmp3.cln3)
>       and (tmp1.cfn1 = tmp3.cfn3)
>       and (tmp1.ddate1= tmp3.ddate3)
> where  
> notnull2 is null and notnull3 is null  
> ;
> -- end query 87 in stream 0 using template query87.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message