spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajkishore Hembram (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-22573) SQL Planner is including unnecessary columns in the projection
Date Tue, 21 Nov 2017 11:30:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-22573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rajkishore Hembram updated SPARK-22573:
---------------------------------------
    Description: 
While I was running TPC-H query 18 for benchmarking, I observed that the query plan for Apache
Spark 2.2.0 is inefficient than other versions of Apache Spark. I noticed that the other versions
of Apache Spark (2.0.2 and 2.1.2) are only including the required columns in the projections.
But the query planner of Apache Spark 2.2.0 is including unnecessary columns into the projection
for some of the queries and hence unnecessarily increasing the I/O. And because of that the
Apache Spark 2.2.0 is taking more time.

[https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view]
[https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view]

TPC-H Query 18
{code:java}
select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) from CUSTOMER,ORDERS,LINEITEM
where O_ORDERKEY in ( select L_ORDERKEY from LINEITEM group by L_ORDERKEY having sum(L_QUANTITY)
> 300 ) and C_CUSTKEY = O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE
order by O_TOTALPRICE desc,O_ORDERDATE
{code}


  was:While I was running TPC-H query 18 for benchmarking, I observed that the query plan
for Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. I noticed that
the other versions of Apache Spark (2.0.2 and 2.1.2) are only including the required columns
in the projections. But the query planner of Apache Spark 2.2.0 is including unnecessary columns
into the projection for some of the queries and hence unnecessarily increasing the I/O. And
because of that the Apache Spark 2.2.0 is taking more time.


> SQL Planner is including unnecessary columns in the projection
> --------------------------------------------------------------
>
>                 Key: SPARK-22573
>                 URL: https://issues.apache.org/jira/browse/SPARK-22573
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Rajkishore Hembram
>
> While I was running TPC-H query 18 for benchmarking, I observed that the query plan for
Apache Spark 2.2.0 is inefficient than other versions of Apache Spark. I noticed that the
other versions of Apache Spark (2.0.2 and 2.1.2) are only including the required columns in
the projections. But the query planner of Apache Spark 2.2.0 is including unnecessary columns
into the projection for some of the queries and hence unnecessarily increasing the I/O. And
because of that the Apache Spark 2.2.0 is taking more time.
> [https://drive.google.com/file/d/1_u8nPKG_SIM7P6fs0VK-8UEXIhWPY_BN/view]
> [https://drive.google.com/file/d/1xtxG5Ext36djfTDSdf_W5vGbbdgRApPo/view]
> TPC-H Query 18
> {code:java}
> select C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE,sum(L_QUANTITY) from CUSTOMER,ORDERS,LINEITEM
where O_ORDERKEY in ( select L_ORDERKEY from LINEITEM group by L_ORDERKEY having sum(L_QUANTITY)
> 300 ) and C_CUSTKEY = O_CUSTKEY and O_ORDERKEY = L_ORDERKEY group by C_NAME,C_CUSTKEY,O_ORDERKEY,O_ORDERDATE,O_TOTALPRICE
order by O_TOTALPRICE desc,O_ORDERDATE
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message