Return-Path: X-Original-To: apmail-incubator-drill-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-drill-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4CDE517656 for ; Wed, 12 Nov 2014 01:36:34 +0000 (UTC) Received: (qmail 81233 invoked by uid 500); 12 Nov 2014 01:36:34 -0000 Delivered-To: apmail-incubator-drill-dev-archive@incubator.apache.org Received: (qmail 81171 invoked by uid 500); 12 Nov 2014 01:36:34 -0000 Mailing-List: contact drill-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: drill-dev@incubator.apache.org Delivered-To: mailing list drill-dev@incubator.apache.org Received: (qmail 81159 invoked by uid 99); 12 Nov 2014 01:36:33 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2014 01:36:33 +0000 Date: Wed, 12 Nov 2014 01:36:33 +0000 (UTC) From: "Aman Sinha (JIRA)" To: drill-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (DRILL-1691) ConvertCountToDirectScan rule should be applicable for 2 or more COUNT aggregates MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Aman Sinha created DRILL-1691: --------------------------------- Summary: ConvertCountToDirectScan rule should be applicable for 2 or more COUNT aggregates Key: DRILL-1691 URL: https://issues.apache.org/jira/browse/DRILL-1691 Project: Apache Drill Issue Type: Bug Components: Query Planning & Optimization Affects Versions: 0.6.0 Reporter: Aman Sinha Assignee: Aman Sinha The ConvertCountToDirectScan rule currently only applies if there is a single COUNT(*) or COUNT(column) aggregate without group-by. This rule should be extended to apply for multiple such aggregates since the rule depends on the underlying ParquetGroupScan providing it the correct column value count and retrieving that count for multiple columns should be fine. However, if even 1 such column does not have statistics, then we should not apply this rule. Here's an example sequence: First do a CTAS such that we ensure that statistics are present for the table (the original Parquet data may not have stats): {code:sql} 0: jdbc:drill:zk=local> create table nation3 as select * from cp.`tpch/nation.parquet`; +------------+---------------------------+ | Fragment | Number of records written | +------------+---------------------------+ | 0_0 | 25 | +------------+---------------------------+ {code} The Explain below shows the count is retrieved directly from the Scan: {code:sql} 0: jdbc:drill:zk=local> explain plan for select count(n_regionkey) as x from nation3; +------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 Project(x=[$0]) 00-02 Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@5db6cb92]) {code} The following query which does 2 aggregates causes the StreamAgg to be introduced in the plan which is not needed: {code:sql} 0: jdbc:drill:zk=local> explain plan for select count(n_regionkey) as x, count(n_nationkey) as y from nation3; +------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 Project(x=[$0], y=[$1]) 00-02 StreamAgg(group=[{}], x=[COUNT($0)], y=[COUNT($1)]) 00-03 Project(n_regionkey=[$1], n_nationkey=[$0]) 00-04 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/nation3]], selectionRoot=/tmp/nation3, numFiles=1, columns=[`n_regionkey`, `n_nationkey`]]]) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)