spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhenhua Wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-20881) Use Hive's stats in metastore when cbo is disabled
Date Thu, 25 May 2017 12:22:05 GMT

     [ https://issues.apache.org/jira/browse/SPARK-20881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhenhua Wang updated SPARK-20881:
---------------------------------
    Description: 
Currently statistics are generated by "analyze command" in Spark. 

However, when user updates the table and collects stats in Hive, "totalSize"/"numRows" will
be updated in metastore. 

Now, in spark side, table stats become stale. 
If cbo is enabled, this is ok because we suppose user will handle this and re-run the command
to update  stats. 
If cbo is disabled, then we should fallback to original way and respect hive's stats. But
in current implementation, spark's stats always override hive's stats, no matter cbo is enabled
or disabled.

The right thing to do is to use (don't override) hive's stats when cbo is disabled.

  was:
Spark's statistics are generated by "analyze command". 

However, when user updates the table and collects stats in Hive, "totalSize"/"numRows" will
be updated in metastore. 

Now, in spark side, table stats are stale even if we turn off cbo, because in current implementation,
spark's stats always override hive's stats, no matter cbo is enabled or disabled.

The right thing to do is to use hive's stats in this case.


> Use Hive's stats in metastore when cbo is disabled
> --------------------------------------------------
>
>                 Key: SPARK-20881
>                 URL: https://issues.apache.org/jira/browse/SPARK-20881
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Zhenhua Wang
>
> Currently statistics are generated by "analyze command" in Spark. 
> However, when user updates the table and collects stats in Hive, "totalSize"/"numRows"
will be updated in metastore. 
> Now, in spark side, table stats become stale. 
> If cbo is enabled, this is ok because we suppose user will handle this and re-run the
command to update  stats. 
> If cbo is disabled, then we should fallback to original way and respect hive's stats.
But in current implementation, spark's stats always override hive's stats, no matter cbo is
enabled or disabled.
> The right thing to do is to use (don't override) hive's stats when cbo is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message