From issues-return-119771-archive-asf-public=cust-asf.ponee.io@hive.apache.org Wed May 23 02:56:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id CA316180679 for ; Wed, 23 May 2018 02:56:03 +0200 (CEST) Received: (qmail 19475 invoked by uid 500); 23 May 2018 00:56:02 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 19461 invoked by uid 99); 23 May 2018 00:56:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 May 2018 00:56:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 509971806E2 for ; Wed, 23 May 2018 00:56:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.511 X-Spam-Level: X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ndxiWr5L_cdo for ; Wed, 23 May 2018 00:56:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id E06425F41A for ; Wed, 23 May 2018 00:56:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7A96DE0239 for ; Wed, 23 May 2018 00:56:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 35528217F3 for ; Wed, 23 May 2018 00:56:00 +0000 (UTC) Date: Wed, 23 May 2018 00:56:00 +0000 (UTC) From: "Sergey Shelukhin (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-19418) add background stats updater similar to compactor MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486532#comment-16486532 ] Sergey Shelukhin commented on HIVE-19418: ----------------------------------------- Rebased the patch (no conflicts, just some offset changes) to run HiveQA again > add background stats updater similar to compactor > ------------------------------------------------- > > Key: HIVE-19418 > URL: https://issues.apache.org/jira/browse/HIVE-19418 > Project: Hive > Issue Type: Bug > Components: Transactions > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Priority: Major > Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, HIVE-19418.patch > > > There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables to make them usable in a transaction without breaking ACID (for metadata-only optimization). However, stats for ACID tables can still become unusable if e.g. two parallel inserts run - neither sees the data written by the other, so after both finish, the snapshots on either set of stats won't match the current snapshot and the stats will be unusable. > Additionally, for ACID and non-ACID tables alike, a lot of the stats, with some exceptions like numRows, cannot be aggregated (i.e. you cannot combine ndvs from two inserts), and for ACID even less can be aggregated (you cannot derive min/max if some rows are deleted but you don't scan the rest of the dataset). > Therefore we will add background logic to metastore (similar to, and partially inside, the ACID compactor) to update stats. > It will have 3 modes of operation. > 1) Off. > 2) Update only the stats that exist but are out of date (generating stats can be expensive, so if the user is only analyzing a subset of tables it should be able to only update that subset). We can simply look at existing stats and only analyze for the relevant partitions and columns. > 3) On: 2 + create stats for all tables and columns missing stats. > There will also be a table parameter to skip stats update. > In phase 1, the process will operate outside of compactor, and run analyze command on the table. The analyze command will automatically save the stats with ACID snapshot information if needed, based on HIVE-19416, so we don't need to do any special state management and this will work for all table types. However it's also more expensive. > In phase 2, we can explore adding stats collection during MM compaction that uses a temp table. If we don't have open writers during major compaction (so we overwrite all of the data), the temp table stats can simply be copied over to the main table with correct snapshot information, saving us a table scan. > In phase 3, we can add custom stats collection logic to full ACID compactor that is not query based, the same way as we'd do for (2). Alternatively we can wait for ACID compactor to become query based and just reuse (2). -- This message was sent by Atlassian JIRA (v7.6.3#76005)