spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-21031) Clearly separate hive stats and spark stats in catalog
Date Fri, 09 Jun 2017 06:24:18 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-21031:
------------------------------------

    Assignee:     (was: Apache Spark)

> Clearly separate hive stats and spark stats in catalog
> ------------------------------------------------------
>
>                 Key: SPARK-21031
>                 URL: https://issues.apache.org/jira/browse/SPARK-21031
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Zhenhua Wang
>
> Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also
persisted through `CatalogStatistics`. Therefore, in `CatalogStatistics`, we cannot tell whether
its stats is from hive or spark. As a result, hive's stats can be unexpectedly propagated
into spark' stats.
> For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it
into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in
`CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from
spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected
from this command.
> Secondly, now that we store wrong spark's stats, after inserting new data, although hive
updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`,
because we respect spark's stats (wrong stats) over hive's stats.
> {code}
> spark-sql> create table xx(i string, j string);
> spark-sql> insert into table xx select 'a', 'b';
> spark-sql> desc formatted xx;
> # col_name	data_type	comment
> i	string	NULL
> j	string	NULL
> # Detailed Table Information		
> Database	default	
> Table	xx	
> Owner	wzh	
> Created	Thu Jun 08 18:30:46 PDT 2017	
> Last Access	Wed Dec 31 16:00:00 PST 1969	
> Type	MANAGED	
> Provider	hive	
> Properties	[serialization.format=1]	
> Statistics	4 bytes	
> Location	file:/Users/wzh/Projects/spark/spark-warehouse/xx	
> Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
> InputFormat	org.apache.hadoop.mapred.TextInputFormat	
> OutputFormat	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	
> Partition Provider	Catalog	
> Time taken: 0.089 seconds, Fetched 19 row(s)
> spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
> Time taken: 0.187 seconds
> spark-sql> insert into table xx select 'c', 'd';
> Time taken: 0.583 seconds
> spark-sql> desc formatted xx;
> # col_name	data_type	comment
> i	string	NULL
> j	string	NULL
> # Detailed Table Information		
> Database	default	
> Table	xx	
> Owner	wzh	
> Created	Thu Jun 08 18:30:46 PDT 2017	
> Last Access	Wed Dec 31 16:00:00 PST 1969	
> Type	MANAGED	
> Provider	hive	
> Properties	[serialization.format=1]	
> Statistics	4 bytes	(-- This should be 8 bytes)
> Location	file:/Users/wzh/Projects/spark/spark-warehouse/xx	
> Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
> InputFormat	org.apache.hadoop.mapred.TextInputFormat	
> OutputFormat	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	
> Partition Provider	Catalog	
> Time taken: 0.077 seconds, Fetched 19 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message