hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chaoyu Tang" <ctang...@gmail.com>
Subject Re: Review Request 38429: HIVE-11786: Deprecate the use of redundant column in colunm stats related tables
Date Thu, 17 Sep 2015 17:33:16 GMT


> On Sept. 16, 2015, 9:26 p.m., Sergey Shelukhin wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java, line
317
> > <https://reviews.apache.org/r/38429/diff/1/?file=1075548#file1075548line317>
> >
> >     that may have perf impact... why not change the queries on metastore instead

ColStatistics in hiveColStats returned from StatsUtils.getTableColumnStats(hiveTblMetadata,
hiveNonPartitionCols, nonPartColNamesThatRqrStats) (line 300) might not match the colNames
in nonPartColNamesThatRqrStats. The order of column names in nonPartColNamesThatRqrStats is
sometimes more likely the order of columns specified in the table DDL, and could neither be
ascending nor descending, so we could not leverage sql or jdo query in HMS to do this ordering.
We can enhance the implementation for APIs like "public List<ColumnStatisticsObj> getTableColumnStatistics(String
dbName, String tableName, List<String> colNames)" to make them to return the ColumnStatisticsObj
in List<ColumnStatisticsObj> to match the column name in passed-in List<String>
colNames by post-ordering the sql/jdo returns. But in the cases where some columns do not
have returned stats, we are still not able to match the returned stats to the requested columns.
For simplicity, I left this ordering burden to the API cl
 ient whoever cares about the order like that in RelOptHiveTable.updateColStats(..). Without
this re-order, the line 366 code (hiveColStatsMap.put(nonPartColIndxsThatRqrStats.get(i),
hiveColStats.get(i));) is not right. Does it make sense?


> On Sept. 16, 2015, 9:26 p.m., Sergey Shelukhin wrote:
> > metastore/src/java/org/apache/hadoop/hive/metastore/StatObjectConverter.java, line
61
> > <https://reviews.apache.org/r/38429/diff/1/?file=1075546#file1075546line61>
> >
> >     what if some code uses this? seems like not setting it is better, the bugs would
be more obvious.
> >     Or for now you could set it from table/db fields but deprecate the getter...
as is it's not proper deprecation cause it may break things.

Yes, good point. I removed the getter methods for db/table/partition names in MTableColumnStatistics/MPartitionColumnStatistics.
I still set these columns to value "Deprecated" in case some one goes to peek at these tables
in backend DB and is confused by the possible mismatched names. I can not set them to null
(or not setting them) since they could not be nullable as specified in their table DDL.


- Chaoyu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38429/#review99307
-----------------------------------------------------------


On Sept. 16, 2015, 12:37 p.m., Chaoyu Tang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38429/
> -----------------------------------------------------------
> 
> (Updated Sept. 16, 2015, 12:37 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Sergey Shelukhin, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-11786
>     https://issues.apache.org/jira/browse/HIVE-11786
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The stats tables such as TAB_COL_STATS, PART_COL_STATS have redundant columns such as
DB_NAME, TABLE_NAME, PARTITION_NAME since these tables already have foreign key like TBL_ID,
or PART_ID referencing to TBLS or PARTITIONS. But these columns are currently used in fetching
column stats (e.g. getTableStats/getPartitionStats) so any Hive operation involved in db/table/partition
name change has to update these columnn, which is not necessary and sometimes quite difficult
in implementation given the limitations from DN and RawStore APIs.
> This patch is to remove the use of these redundant columns at HMS code level. The changes
include:
> 1. Instead of directly using these columns in TAB_COL_STATS, PART_COL_STATS, use these
in their referenced tables.
> 2. currently the CBO code assumes that the column stats returned from HMS are in the
same order as that passed in column request. It is not gurantteed and has been changed.
> 3. The deprecated redundant columns are now temorarily populated with value "Deprecated".
They will be removed in a followed up JIRA.
> 
> 
> Diffs
> -----
> 
>   metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java 1f89b7c

>   metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 4d6bfcc 
>   metastore/src/java/org/apache/hadoop/hive/metastore/StatObjectConverter.java b3ceff1

>   metastore/src/test/org/apache/hadoop/hive/metastore/VerifyingObjectStore.java 7e46523

>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java 6c0bd25

> 
> Diff: https://reviews.apache.org/r/38429/diff/
> 
> 
> Testing
> -------
> 
> 1. Manually tested some cases against MySQL/PostgreSQL/Oracle.
> 2. Is running precommit test.
> 
> 
> Thanks,
> 
> Chaoyu Tang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message