drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miroslav Holubec (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-4601) Partitioning based on the parquet statistics
Date Wed, 13 Apr 2016 09:03:25 GMT

     [ https://issues.apache.org/jira/browse/DRILL-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Miroslav Holubec updated DRILL-4601:
------------------------------------
    Description: 
It can really help performance to extend current partitioning idea implemented in DRILL-3333
even further.
Currently partitioning is based on statistics, when min value equals to max value for whole
file. Based on this, files are removed from scan in planning phase. Problem is, that it leads
to many small parquet files, which is not fine in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if value should
equal to constant, remove all files from plan which have statistics off. This will really
help performance for scans over many parquet files.

I have initial patch ready, currently just to give an idea. (it changes metadata v2, which
is not fine and also it currently supports only equal operation).

  was:
It can really help performance to extend current partitioning idea implemented in DRILL-3333
even further.
Currently partitioning is based on statistics, when min value equals to max value for whole
file. Based on this files are removed from scan in planning phase. Problem with this is, that
it leads to many small parquet files, which is not fine in HDFS world. Also only few columns
are partitioned.

I would like to extend this idea to use all statistics for all columns. So if value should
equal to constant, remove all files from plan which have statistics off. This will really
help performance for scans over many parquet files.

I have initial patch ready, currently just to give an idea (it is reusing metadata v2)


> Partitioning based on the parquet statistics
> --------------------------------------------
>
>                 Key: DRILL-4601
>                 URL: https://issues.apache.org/jira/browse/DRILL-4601
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>            Reporter: Miroslav Holubec
>              Labels: parquet, partitioning, planning, statistics
>         Attachments: DRILL-4601.1.patch
>
>
> It can really help performance to extend current partitioning idea implemented in DRILL-3333
even further.
> Currently partitioning is based on statistics, when min value equals to max value for
whole file. Based on this, files are removed from scan in planning phase. Problem is, that
it leads to many small parquet files, which is not fine in HDFS world. Also only few columns
are partitioned.
> I would like to extend this idea to use all statistics for all columns. So if value should
equal to constant, remove all files from plan which have statistics off. This will really
help performance for scans over many parquet files.
> I have initial patch ready, currently just to give an idea. (it changes metadata v2,
which is not fine and also it currently supports only equal operation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message