From dev-return-155047-archive-asf-public=cust-asf.ponee.io@hive.apache.org Thu Mar 19 13:47:04 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 82B08180661 for ; Thu, 19 Mar 2020 14:47:04 +0100 (CET) Received: (qmail 966 invoked by uid 500); 19 Mar 2020 13:47:03 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 846 invoked by uid 99); 19 Mar 2020 13:47:03 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Mar 2020 13:47:03 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id A278BE2F8D for ; Thu, 19 Mar 2020 13:47:02 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id B87D178231F for ; Thu, 19 Mar 2020 13:47:00 +0000 (UTC) Date: Thu, 19 Mar 2020 13:47:00 +0000 (UTC) From: "David Mollitor (Jira)" To: dev@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-23054) Capture Total Byte Size in Column Statistics MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 David Mollitor created HIVE-23054: ------------------------------------- Summary: Capture Total Byte Size in Column Statistics Key: HIVE-23054 URL: https://issues.apache.org/jira/browse/HIVE-23054 Project: Hive Issue Type: Improvement Components: CBO, Statistics Reporter: David Mollitor Store a counter in HMS column statics for the total number of bytes (raw) in each column. Right now, there is no good way to merge the average column length when performing an INSERT statement into a table. Right now, the code just selects the maximum value, however, if inserting a single records with a long length (128 bytes) into a table that has millions of strings with an average length of 4, the average length for the entire data set gets boosted to 128. {code:java} aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(), newData.getAvgColLen())); {code} https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34 Store the total raw size of all the data in each column. Between the total raw size, and the average length, one can compute the real average length when merging the exiting data and the newly inserted data. -- This message was sent by Atlassian Jira (v8.3.4#803005)