drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From François Méthot <fmetho...@gmail.com>
Subject Re: Limit the number of output parquet files in CTAS
Date Wed, 02 Nov 2016 00:24:14 GMT
Thanks Andries,

I experimented with the order by and it works as you mentionned.

I will do some reading and experimentation with the store.partition.hash_
distribute.

Francois




On Mon, Oct 31, 2016 at 4:24 PM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> You can try and set store.partition.hash_distribute to true, but it is
> still listed as an alpha feature.
>
> You can also add a sort operation (order by) to the CTAS statement to
> force a single data stream at output. I believe this was discussed a while
> back on the user list.
>
> Ideally you want to look at the data set size and how much parallelism
> would work best in your environment for reading the data later.
>
> --Andries
>
>
> > On Oct 31, 2016, at 12:57 PM, François Méthot <fmethot78@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Is there a way to limit the number of files produced by a CTAS query ?
> > I would like the speed benefits of having hundreds of scanner fragment
> but
> > don't want to deal with hundreds of output files.
> >
> > Our usecase right now is using 880 thread to scan and produce a report
> > output spread over... 880 parquets files.
> > Each resulting file is ~7M.
> >
> > Only way I found to reduce those files to smaller set is  to a perform
> > second CTAS query on the aggregated files with
> planner.width.max_per_query
> > set to smaller number.
> >
> > Any possible way to do this in one query?
> >
> > Thanks
> > Francois
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message