drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-1434) count of a nullable column in tpcds gives incorrect results
Date Sun, 09 Nov 2014 21:29:33 GMT

    [ https://issues.apache.org/jira/browse/DRILL-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204123#comment-14204123
] 

Jacques Nadeau commented on DRILL-1434:
---------------------------------------

I'm confused by what you said.  It sounds like the issue is simply that we should consult
the number of non-null column values rather than total # of values in the parquet file when
trying to determine a count.  If they are available, return that, otherwise rule should not
match.  It looks like we should update ParquetGroupScan to check ColumnChunkMetaData.getStatistics().getNumNulls()
and subtract that from total value count (if the stats are available).

> count of a nullable column in tpcds gives incorrect results
> -----------------------------------------------------------
>
>                 Key: DRILL-1434
>                 URL: https://issues.apache.org/jira/browse/DRILL-1434
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 0.6.0
>            Reporter: Chun Chang
>            Assignee: Aman Sinha
>
> code base 
> #Fri Sep 12 14:08:02 PDT 2014
> git.commit.id.abbrev=9e16466
> I have a parquet file (tpcds data) which contains null value on a column. The total count
of the column:
> 0: jdbc:drill:schema=dfs> select count(ss_quantity) from `tpcds/p1/store_sales.parquet`;
> +------------+
> |   EXPR$0   |
> +------------+
> | 2880404    |
> +------------+
> The count without considering null is:
> 0: jdbc:drill:schema=dfs> select count(ss_quantity) from `tpcds/p1/store_sales.parquet`
where ss_quantity is not null;
> +------------+
> |   EXPR$0   |
> +------------+
> | 2750408    |
> +------------+
> But the count for null value is zero:
> 0: jdbc:drill:schema=dfs> select count(ss_quantity) from `tpcds/p1/store_sales.parquet`
where ss_quantity is null;
> +------------+
> |   EXPR$0   |
> +------------+
> | 0          |
> +------------+
> Here is the physical plan look like for this query:
> 0: jdbc:drill:schema=dfs> explain plan for select count(ss_quantity) from `tpcds/p1/store_sales.parquet`
where ss_quantity is null;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      StreamAgg(group=[{}], EXPR$0=[COUNT($0)])
> 00-02        Filter(condition=[IS NULL($0)])
> 00-03          ProducerConsumer
> 00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/user/root/mondrian/tpcds/p1/store_sales.parquet]],
selectionRoot=/user/root/mondrian/tpcds/p1/store_sales.parquet, columns=[SchemaPath [`ss_quantity`]]]])
>  | {
>   "head" : {
>     "version" : 1,
>     "generator" : {
>       "type" : "ExplainHandler",
>       "info" : ""
>     },
>     "type" : "APACHE_DRILL_PHYSICAL",
>     "options" : [ ],
>     "queue" : 0,
>     "resultMode" : "EXEC"
>   },
>   "graph" : [ {
>     "pop" : "parquet-scan",
>     "@id" : 4,
>     "entries" : [ {
>       "path" : "maprfs:/user/root/mondrian/tpcds/p1/store_sales.parquet"
>     } ],
>     "storage" : {
>       "type" : "file",
>       "enabled" : true,
>       "connection" : "maprfs:///",
>       "workspaces" : {
>         "default" : {
>           "location" : "/user/root/mondrian/",
>           "writable" : true,
>           "storageformat" : null
>         },
>         "home" : {
>           "location" : "/",
>           "writable" : false,
>           "storageformat" : null
>         },
>         "root" : {
>           "location" : "/",
>           "writable" : false,
>           "storageformat" : null
>         },
>         "abhi" : {
>           "location" : "/tables",
>           "writable" : true,
>           "storageformat" : "csv"
>         },
>         "chun" : {
>           "location" : "/drill/testdata/chun/",
>           "writable" : false,
>           "storageformat" : null
>         },
>         "tmp" : {
>           "location" : "/tmp",
>           "writable" : true,
>           "storageformat" : "csv"
>         }
>       },
>       "formats" : {
>         "psv" : {
>           "type" : "text",
>           "extensions" : [ "tbl" ],
>           "delimiter" : "|"
>         },
>         "csv" : {
>           "type" : "text",
>           "extensions" : [ "csv" ],
>           "delimiter" : ","
>         },
>         "tsv" : {
>           "type" : "text",
>           "extensions" : [ "tsv" ],
>           "delimiter" : "\t"
>         },
>         "parquet" : {
>           "type" : "parquet"
>         },
>         "json" : {
>           "type" : "json"
>         }
>       }
>     },
>     "format" : {
>       "type" : "parquet"
>     },
>     "columns" : [ "`ss_quantity`" ],
>     "selectionRoot" : "/user/root/mondrian/tpcds/p1/store_sales.parquet",
>     "cost" : 2880404.0
>   }, {
>     "pop" : "producer-consumer",
>     "@id" : 3,
>     "child" : 4,
>     "size" : 10,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 2880404.0
>   }, {
>     "pop" : "filter",
>     "@id" : 2,
>     "child" : 3,
>     "expr" : "isnull(`ss_quantity`) ",
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 720101.0
>   }, {
>     "pop" : "streaming-aggregate",
>     "@id" : 1,
>     "child" : 2,
>     "keys" : [ ],
>     "exprs" : [ {
>       "ref" : "`EXPR$0`",
>       "expr" : "count(`ss_quantity`) "
>     } ],
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 1.0
>   }, {
>     "pop" : "screen",
>     "@id" : 0,
>     "child" : 1,
>     "initialAllocation" : 1000000,
>     "maxAllocation" : 10000000000,
>     "cost" : 72010.1
>   } ]
> } |
> +------------+------------+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message