Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D75DAC850 for ; Wed, 12 Nov 2014 19:46:59 +0000 (UTC) Received: (qmail 40701 invoked by uid 500); 12 Nov 2014 19:46:59 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 40538 invoked by uid 500); 12 Nov 2014 19:46:59 -0000 Mailing-List: contact issues-help@drill.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.incubator.apache.org Delivered-To: mailing list issues@drill.incubator.apache.org Received: (qmail 40523 invoked by uid 99); 12 Nov 2014 19:46:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2014 19:46:59 +0000 X-ASF-Spam-Status: No, hits=-2000.6 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 12 Nov 2014 19:46:36 +0000 Received: (qmail 39652 invoked by uid 99); 12 Nov 2014 19:46:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2014 19:46:34 +0000 Date: Wed, 12 Nov 2014 19:46:34 +0000 (UTC) From: "Aman Sinha (JIRA)" To: issues@drill.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (DRILL-1434) count of a nullable column in tpcds gives incorrect results MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/DRILL-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Sinha updated DRILL-1434: ------------------------------ Attachment: 0001-DRILL-1434-In-ParquetGroupScan-compute-the-non-null-.patch Updated patch after removing 1 extraneous comment line. Regarding the accuracy of the statistics, the assumption is that the total count, null count and min/max values are accurate and not based on sampling. > count of a nullable column in tpcds gives incorrect results > ----------------------------------------------------------- > > Key: DRILL-1434 > URL: https://issues.apache.org/jira/browse/DRILL-1434 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 0.6.0 > Reporter: Chun Chang > Assignee: Aman Sinha > Attachments: 0001-DRILL-1434-In-ParquetGroupScan-compute-the-non-null-.patch > > > code base > #Fri Sep 12 14:08:02 PDT 2014 > git.commit.id.abbrev=9e16466 > I have a parquet file (tpcds data) which contains null value on a column. The total count of the column: > 0: jdbc:drill:schema=dfs> select count(ss_quantity) from `tpcds/p1/store_sales.parquet`; > +------------+ > | EXPR$0 | > +------------+ > | 2880404 | > +------------+ > The count without considering null is: > 0: jdbc:drill:schema=dfs> select count(ss_quantity) from `tpcds/p1/store_sales.parquet` where ss_quantity is not null; > +------------+ > | EXPR$0 | > +------------+ > | 2750408 | > +------------+ > But the count for null value is zero: > 0: jdbc:drill:schema=dfs> select count(ss_quantity) from `tpcds/p1/store_sales.parquet` where ss_quantity is null; > +------------+ > | EXPR$0 | > +------------+ > | 0 | > +------------+ > Here is the physical plan look like for this query: > 0: jdbc:drill:schema=dfs> explain plan for select count(ss_quantity) from `tpcds/p1/store_sales.parquet` where ss_quantity is null; > +------------+------------+ > | text | json | > +------------+------------+ > | 00-00 Screen > 00-01 StreamAgg(group=[{}], EXPR$0=[COUNT($0)]) > 00-02 Filter(condition=[IS NULL($0)]) > 00-03 ProducerConsumer > 00-04 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/user/root/mondrian/tpcds/p1/store_sales.parquet]], selectionRoot=/user/root/mondrian/tpcds/p1/store_sales.parquet, columns=[SchemaPath [`ss_quantity`]]]]) > | { > "head" : { > "version" : 1, > "generator" : { > "type" : "ExplainHandler", > "info" : "" > }, > "type" : "APACHE_DRILL_PHYSICAL", > "options" : [ ], > "queue" : 0, > "resultMode" : "EXEC" > }, > "graph" : [ { > "pop" : "parquet-scan", > "@id" : 4, > "entries" : [ { > "path" : "maprfs:/user/root/mondrian/tpcds/p1/store_sales.parquet" > } ], > "storage" : { > "type" : "file", > "enabled" : true, > "connection" : "maprfs:///", > "workspaces" : { > "default" : { > "location" : "/user/root/mondrian/", > "writable" : true, > "storageformat" : null > }, > "home" : { > "location" : "/", > "writable" : false, > "storageformat" : null > }, > "root" : { > "location" : "/", > "writable" : false, > "storageformat" : null > }, > "abhi" : { > "location" : "/tables", > "writable" : true, > "storageformat" : "csv" > }, > "chun" : { > "location" : "/drill/testdata/chun/", > "writable" : false, > "storageformat" : null > }, > "tmp" : { > "location" : "/tmp", > "writable" : true, > "storageformat" : "csv" > } > }, > "formats" : { > "psv" : { > "type" : "text", > "extensions" : [ "tbl" ], > "delimiter" : "|" > }, > "csv" : { > "type" : "text", > "extensions" : [ "csv" ], > "delimiter" : "," > }, > "tsv" : { > "type" : "text", > "extensions" : [ "tsv" ], > "delimiter" : "\t" > }, > "parquet" : { > "type" : "parquet" > }, > "json" : { > "type" : "json" > } > } > }, > "format" : { > "type" : "parquet" > }, > "columns" : [ "`ss_quantity`" ], > "selectionRoot" : "/user/root/mondrian/tpcds/p1/store_sales.parquet", > "cost" : 2880404.0 > }, { > "pop" : "producer-consumer", > "@id" : 3, > "child" : 4, > "size" : 10, > "initialAllocation" : 1000000, > "maxAllocation" : 10000000000, > "cost" : 2880404.0 > }, { > "pop" : "filter", > "@id" : 2, > "child" : 3, > "expr" : "isnull(`ss_quantity`) ", > "initialAllocation" : 1000000, > "maxAllocation" : 10000000000, > "cost" : 720101.0 > }, { > "pop" : "streaming-aggregate", > "@id" : 1, > "child" : 2, > "keys" : [ ], > "exprs" : [ { > "ref" : "`EXPR$0`", > "expr" : "count(`ss_quantity`) " > } ], > "initialAllocation" : 1000000, > "maxAllocation" : 10000000000, > "cost" : 1.0 > }, { > "pop" : "screen", > "@id" : 0, > "child" : 1, > "initialAllocation" : 1000000, > "maxAllocation" : 10000000000, > "cost" : 72010.1 > } ] > } | > +------------+------------+ -- This message was sent by Atlassian JIRA (v6.3.4#6332)