Mailing-List: contact issues-help@impala.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@impala.incubator.apache.org
Date: Fri, 17 Mar 2017 21:55:41 +0000 (UTC)
From: "Alexander Behm (JIRA)" <jira@apache.org>
To: issues@impala.incubator.apache.org
Message-ID: <JIRA.13057146.1489781980000.54231.1489787741534@Atlassian.JIRA>
In-Reply-To: <JIRA.13057146.1489781980000@Atlassian.JIRA>
References: <JIRA.13057146.1489781980000@Atlassian.JIRA> <JIRA.13057146.1489781980470@jira-lw-us.apache.org>
Subject: [jira] [Updated] (IMPALA-5095) Use parquet::Statistics for simple
 min/max aggregates
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 17 Mar 2017 21:55:47 -0000


     [ https://issues.apache.org/jira/browse/IMPALA-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Behm updated IMPALA-5095:
-----------------------------------
    Description: 
{code}
select min(int_col), max(bigint_col) from parquet_table;
select min(int_col), max(bigint_col) from parquet_table group by partition_col;
select min(int_col), max(int_col) from parquet_table group by partition_col; <--- case a little trickier because int_col refd twice
{code}

The slot values for int_col and bigint_col can be directly filled in from the parquet::Statistics, assuming stats are available for both columns. No columns need to be scanned/materialized.

This JIRA focuses on implementing this optimization in the simple case where all scanned columns feed into min/max aggregates and where all columns have parquet::Statistics. Those conditions can be relaxed, but should be addressed separately.

This optimization opportunity must be detected by the planner and is not applicable when there are scan predicates.


  was:
{code}
select min(int_col), max(bigint_col) from parquet_table;
select min(int_col), max(bigint_col) from parquet_table group by partition_col;
{code}

The slot values for int_col and bigint_col can be directly filled in from the parquet::Statistics, assuming stats are available for both columns. No columns need to be scanned/materialized.

This JIRA focuses on implementing this optimization in the simple case where all scanned columns feed into min/max aggregates and where all columns have parquet::Statistics. Those conditions can be relaxed, but should be addressed separately.

This optimization opportunity must be detected by the planner and is not applicable when there are scan predicates.


> Use parquet::Statistics for simple min/max aggregates
> -----------------------------------------------------
>
>                 Key: IMPALA-5095
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5095
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>    Affects Versions: Impala 2.8.0
>            Reporter: Alexander Behm
>              Labels: parquet, perfomance, ramp-up
>
> {code}
> select min(int_col), max(bigint_col) from parquet_table;
> select min(int_col), max(bigint_col) from parquet_table group by partition_col;
> select min(int_col), max(int_col) from parquet_table group by partition_col; <--- case a little trickier because int_col refd twice
> {code}
> The slot values for int_col and bigint_col can be directly filled in from the parquet::Statistics, assuming stats are available for both columns. No columns need to be scanned/materialized.
> This JIRA focuses on implementing this optimization in the simple case where all scanned columns feed into min/max aggregates and where all columns have parquet::Statistics. Those conditions can be relaxed, but should be addressed separately.
> This optimization opportunity must be detected by the planner and is not applicable when there are scan predicates.


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)