incubator-drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Parquet file partition size
Date Mon, 08 Sep 2014 19:14:55 GMT
Cool.



On Mon, Sep 8, 2014 at 11:01 AM, Jacques Nadeau <jacques@apache.org> wrote:

> For all three of these variables, you can use the ALTER SESSION or ALTER
> SYSTEM statements.  See more here:
>
> https://cwiki.apache.org/confluence/display/DRILL/SQL+Commands+Summary
>
> https://cwiki.apache.org/confluence/display/DRILL/Planning+and+Execution+Options
>
> example usage:
>
> ALTER SESSION `planner.slice_target` = 100000;
>
>
>
> On Mon, Sep 8, 2014 at 10:50 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Where are these variables best modified?
> >
> >
> >
> >
> > On Mon, Sep 8, 2014 at 8:40 AM, Jacques Nadeau <jacques@apache.org>
> wrote:
> >
> > > Drill's default behavior is to use estimates to determine the number of
> > > files that will be written.  The equation is fairly complicated.
> > However,
> > > there are three key variables that will impact file splits.  These are:
> > >
> > > planner.slice_target: targeted number of records to allow within a
> single
> > > slice before increasing parallelization (defaults to 1mm in 0.4, 100k
> in
> > > 0.5)
> > > planner.width.max_per_node: maximum number of slices run per node
> > (defaults
> > > to 0.7 * core count)
> > > store.parquet.block-size:   largest allowed row group when generating
> > > Parquet files.  (defaults to 512mb)
> > >
> > > If you are having more files than you would like, you can
> > > decrease planner.width.max_per_node to a smaller number.
> > >
> > > It's likely that Jim Scott's experience with a smaller number of files
> > was
> > > due to running on a machine with a smaller number of cores or the
> > optimizer
> > > estimating a smaller amount of data in the output.  The behavior is
> data
> > > and machine dependent.
> > >
> > > thanks,
> > > Jacques
> > >
> > >
> > > On Mon, Sep 8, 2014 at 8:32 AM, Jim Scott <jscott@maprtech.com> wrote:
> > >
> > > > I have created tables with Drill in parquet format and it created 2
> > > files.
> > > >
> > > >
> > > > On Fri, Sep 5, 2014 at 3:46 PM, Jim <jimfcarroll@gmail.com> wrote:
> > > >
> > > > >
> > > > > Actually, it looks like it always breaks it into 6 pieces by
> default.
> > > Is
> > > > > there a way to make the partition size fixed rather than the number
> > of
> > > > > partitions?
> > > > >
> > > > >
> > > > > On 09/05/2014 04:40 PM, Jim wrote:
> > > > >
> > > > >> Hello all,
> > > > >>
> > > > >> I've been experimenting with drill to load data into Parquet
> files.
> > I
> > > > >> noticed rather large variability in the size of each parquet
> chunk.
> > Is
> > > > >> there a way to control this?
> > > > >>
> > > > >> The documentation seems a little sparse on configuring some of
the
> > > finer
> > > > >> details. My apologies if I missed something obvious.
> > > > >>
> > > > >> Thanks
> > > > >> Jim
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > > --
> > > > *Jim Scott*
> > > > Director, Enterprise Strategy & Architecture
> > > >
> > > >  <http://www.mapr.com/>
> > > > [image: MapR Technologies] <http://www.mapr.com>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message