Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DC63B200CB0 for ; Fri, 9 Jun 2017 08:14:24 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id DAE3A160BE5; Fri, 9 Jun 2017 06:14:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2DF4F160BD5 for ; Fri, 9 Jun 2017 08:14:24 +0200 (CEST) Received: (qmail 83559 invoked by uid 500); 9 Jun 2017 06:14:23 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 83550 invoked by uid 99); 9 Jun 2017 06:14:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jun 2017 06:14:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id C571B1A03D9 for ; Fri, 9 Jun 2017 06:14:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id HoQkL5HmTfvg for ; Fri, 9 Jun 2017 06:14:21 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 8BFF15F6BE for ; Fri, 9 Jun 2017 06:14:20 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 75188E0663 for ; Fri, 9 Jun 2017 06:14:19 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 24C4821E11 for ; Fri, 9 Jun 2017 06:14:18 +0000 (UTC) Date: Fri, 9 Jun 2017 06:14:18 +0000 (UTC) From: "Zhenhua Wang (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (SPARK-21031) Clearly separate hive stats and spark stats in catalog MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 09 Jun 2017 06:14:25 -0000 Zhenhua Wang created SPARK-21031: ------------------------------------ Summary: Clearly separate hive stats and spark stats in catalog Key: SPARK-21031 URL: https://issues.apache.org/jira/browse/SPARK-21031 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Zhenhua Wang Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. Therefore, in `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, by using "ALTER TABLE" command, we will store the stats info (read from hive, e.g. "totalSize") in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Besides, now that we store wrong spark's stats, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect the wrong spark stats over hive's stats. {code} spark-sql> create table xx(i string, j string); spark-sql> insert into table xx select 'a', 'b'; spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Database default Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 Type MANAGED Provider hive Properties [serialization.format=1] Statistics 4 bytes Location file:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.089 seconds, Fetched 19 row(s) spark-sql> alter table xx set tblproperties ('prop1' = 'yy'); Time taken: 0.187 seconds spark-sql> insert into table xx select 'c', 'd'; Time taken: 0.583 seconds spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Database default Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 Type MANAGED Provider hive Properties [serialization.format=1] Statistics 4 bytes (-- This should be 8 bytes) Location file:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.077 seconds, Fetched 19 row(s) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org