drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: Limit the number of output parquet files in CTAS
Date Mon, 31 Oct 2016 20:24:48 GMT
You can try and set store.partition.hash_distribute to true, but it is still listed as an alpha
feature.

You can also add a sort operation (order by) to the CTAS statement to force a single data
stream at output. I believe this was discussed a while back on the user list.

Ideally you want to look at the data set size and how much parallelism would work best in
your environment for reading the data later.

--Andries


> On Oct 31, 2016, at 12:57 PM, François Méthot <fmethot78@gmail.com> wrote:
> 
> Hi,
> 
> Is there a way to limit the number of files produced by a CTAS query ?
> I would like the speed benefits of having hundreds of scanner fragment but
> don't want to deal with hundreds of output files.
> 
> Our usecase right now is using 880 thread to scan and produce a report
> output spread over... 880 parquets files.
> Each resulting file is ~7M.
> 
> Only way I found to reduce those files to smaller set is  to a perform
> second CTAS query on the aggregated files with planner.width.max_per_query
> set to smaller number.
> 
> Any possible way to do this in one query?
> 
> Thanks
> Francois


Mime
View raw message