impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Behm (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IMPALA-5096) Use parquet::Statistics for min/max aggregates when only a subset of scan columns have stats
Date Fri, 17 Mar 2017 22:23:41 GMT

     [ https://issues.apache.org/jira/browse/IMPALA-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Behm updated IMPALA-5096:
-----------------------------------
    Description: 
{code}
select min(int_col), max(timestamp_col) from parquet_table;
select min(int_col), max(int_col), max(timestamp_col) from parquet_table; <--- can we do
this or do we have to bail?
{code}
If some columns do not have parquet::Statistics, then it is still possible to use the stats
of those columns that do have them, but with more effort. For those columns that have stats,
we can populate the scanner's template tuple with the stats values, and avoid scanning/materializing
those columns. We still need to scan the columns that do not have stats.

Also consider how the various optimizations in IMPALA-4986 will interact. For example,
{code}
select min(string_col), count(*) from parquet_table
{code}
Can we still safely apply any of the optimizations?


  was:
{code}
select min(int_col), max(timestamp_col) from parquet_table;
{code}
If some columns do not have parquet::Statistics, then it is still possible to use the stats
of those columns that do have them, but with more effort. For those columns that have stats,
we can populate the scanner's template tuple with the stats values, and avoid scanning/materializing
those columns. We still need to scan the columns that do not have stats.

Also consider how the various optimizations in IMPALA-4986 will interact. For example,
{code}
select min(string_col), count(*) from parquet_table
{code}
Can we still safely apply any of the optimizations?



> Use parquet::Statistics for min/max aggregates when only a subset of scan columns have
stats
> --------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-5096
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5096
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>    Affects Versions: Impala 2.8.0
>            Reporter: Alexander Behm
>              Labels: parquet, performance, ramp-up
>
> {code}
> select min(int_col), max(timestamp_col) from parquet_table;
> select min(int_col), max(int_col), max(timestamp_col) from parquet_table; <--- can
we do this or do we have to bail?
> {code}
> If some columns do not have parquet::Statistics, then it is still possible to use the
stats of those columns that do have them, but with more effort. For those columns that have
stats, we can populate the scanner's template tuple with the stats values, and avoid scanning/materializing
those columns. We still need to scan the columns that do not have stats.
> Also consider how the various optimizations in IMPALA-4986 will interact. For example,
> {code}
> select min(string_col), count(*) from parquet_table
> {code}
> Can we still safely apply any of the optimizations?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message