Return-Path: X-Original-To: apmail-hive-issues-archive@minotaur.apache.org Delivered-To: apmail-hive-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5782E18CA4 for ; Tue, 10 Nov 2015 20:39:11 +0000 (UTC) Received: (qmail 19125 invoked by uid 500); 10 Nov 2015 20:39:11 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 19095 invoked by uid 500); 10 Nov 2015 20:39:11 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 19071 invoked by uid 99); 10 Nov 2015 20:39:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Nov 2015 20:39:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 1559A2C0453 for ; Tue, 10 Nov 2015 20:39:11 +0000 (UTC) Date: Tue, 10 Nov 2015 20:39:11 +0000 (UTC) From: "Prasanth Jayachandran (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-12309) TableScan should use column stats when available for better data size estimate MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14999307#comment-14999307 ] Prasanth Jayachandran commented on HIVE-12309: ---------------------------------------------- Left a minor comment in RB. I am worried about the scenario of INCOMPLETE column stats. What happens if column stats is missing or stale? raw data size will always be updated (if the appropriate configs are on and if the fileformat supports it), but column stats freshness is not guaranteed. How do we deal with it in the estimation? > TableScan should use column stats when available for better data size estimate > ------------------------------------------------------------------------------ > > Key: HIVE-12309 > URL: https://issues.apache.org/jira/browse/HIVE-12309 > Project: Hive > Issue Type: Improvement > Components: Statistics > Reporter: Ashutosh Chauhan > Assignee: Ashutosh Chauhan > Attachments: HIVE-12309.2.patch, HIVE-12309.patch > > > Currently, all other operators use column stats to figure out data size, whereas TableScan relies on rawDataSize. This inconsistency can result in an inconsistency where TS may have lower Datasize then subsequent operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)