hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Kolbasov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.
Date Sun, 04 Mar 2018 19:11:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385293#comment-16385293
] 

Alexander Kolbasov commented on HIVE-18743:
-------------------------------------------

I noticed a bit of an odd code:
{code:java}
public static void setBasicStatsState(Map<String, String> params, String setting) {
  ...
  ColumnStatsAccurate stats = parseStatsAcc(params.get(COLUMN_STATS_ACCURATE));
  stats.basicStats = true;
}{code}
So  it parses the value of {{COLUMN_STATS_ACCURATE}} but then always ignores it and sets {{stats.basicStats}}
to true anyway. Is it intentional? Can this be removed?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-18743
>                 URL: https://issues.apache.org/jira/browse/HIVE-18743
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>    Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>            Reporter: Alexander Behm
>            Assignee: Alexander Kolbasov
>            Priority: Major
>         Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the table directory
to populate basic stats like file counts and sizes. This file listing operation can be very
expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is intended to selectively
prevent this stats collection. Unfortunately, this table property is checked *after* the expensive
file listing operation, so the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, Warehouse wh,
>                                              boolean madeDir, boolean forceRecompute,
EnvironmentContext environmentContext) throws MetaException {
>     if (tbl.getPartitionKeysSize() == 0) {
>       // Update stats only when unpartitioned
>       FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, tbl);
>       return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, environmentContext);
<--- DO_NOT_UPDATE_STATS is checked in here after wh.getFileStatusesForUnpartitionedTable()
has already been called
>     } else {
>       return false;
>     }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message