Return-Path: X-Original-To: apmail-incubator-drill-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-drill-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2068A11C12 for ; Mon, 8 Sep 2014 15:40:58 +0000 (UTC) Received: (qmail 18297 invoked by uid 500); 8 Sep 2014 15:40:57 -0000 Delivered-To: apmail-incubator-drill-user-archive@incubator.apache.org Received: (qmail 18245 invoked by uid 500); 8 Sep 2014 15:40:57 -0000 Mailing-List: contact drill-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: drill-user@incubator.apache.org Delivered-To: mailing list drill-user@incubator.apache.org Received: (qmail 18233 invoked by uid 99); 8 Sep 2014 15:40:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2014 15:40:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jacques.drill@gmail.com designates 209.85.216.177 as permitted sender) Received: from [209.85.216.177] (HELO mail-qc0-f177.google.com) (209.85.216.177) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2014 15:40:31 +0000 Received: by mail-qc0-f177.google.com with SMTP id i8so15824506qcq.36 for ; Mon, 08 Sep 2014 08:40:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=DFDatstvlctb0ENbuSjd0/46hi9yZtCSXpwOaRzZ4+Y=; b=HZuWfjLQLYwIZaW0nQHn6mB+ozX4CPgMMuN0jjCfsF9YqYYYY/zwbHmlvUqNjIPHjq uG266WbcVi84HukwPND5xYp+7KhKRFPuxVCOz7ZGDoLaam85PhXnOkKpMPI+ONHC37f4 cuuFThtWBhufrTYmxSA38rzUGZBGfrNEWul3KWfA0Uz2AIFLyHFcvaFrxRVxD+J/Q6ar VGkC2DMcz0qyp9EEA+rD1XGKBMfIU6NZ054OEbd+0BbINAWjH0VcBsuTXL4EdUhzxo2v aptwNERd6UGH+3oQVEThBJjp6vt/eP0K2vHXKmRDDtTBFeVCSbyPQlLpXwstzOayQqda aBzw== MIME-Version: 1.0 X-Received: by 10.140.20.105 with SMTP id 96mr6971929qgi.33.1410190830633; Mon, 08 Sep 2014 08:40:30 -0700 (PDT) Sender: jacques.drill@gmail.com Received: by 10.96.177.70 with HTTP; Mon, 8 Sep 2014 08:40:30 -0700 (PDT) In-Reply-To: References: <540A1FDB.4070605@gmail.com> <540A2129.3030507@gmail.com> Date: Mon, 8 Sep 2014 08:40:30 -0700 X-Google-Sender-Auth: JzlZbhCSxDZLk09cdSBy594zJo4 Message-ID: Subject: Re: Parquet file partition size From: Jacques Nadeau To: "drill-user@incubator.apache.org" Content-Type: multipart/alternative; boundary=001a11c12af2868c3c05028fa31b X-Virus-Checked: Checked by ClamAV on apache.org --001a11c12af2868c3c05028fa31b Content-Type: text/plain; charset=UTF-8 Drill's default behavior is to use estimates to determine the number of files that will be written. The equation is fairly complicated. However, there are three key variables that will impact file splits. These are: planner.slice_target: targeted number of records to allow within a single slice before increasing parallelization (defaults to 1mm in 0.4, 100k in 0.5) planner.width.max_per_node: maximum number of slices run per node (defaults to 0.7 * core count) store.parquet.block-size: largest allowed row group when generating Parquet files. (defaults to 512mb) If you are having more files than you would like, you can decrease planner.width.max_per_node to a smaller number. It's likely that Jim Scott's experience with a smaller number of files was due to running on a machine with a smaller number of cores or the optimizer estimating a smaller amount of data in the output. The behavior is data and machine dependent. thanks, Jacques On Mon, Sep 8, 2014 at 8:32 AM, Jim Scott wrote: > I have created tables with Drill in parquet format and it created 2 files. > > > On Fri, Sep 5, 2014 at 3:46 PM, Jim wrote: > > > > > Actually, it looks like it always breaks it into 6 pieces by default. Is > > there a way to make the partition size fixed rather than the number of > > partitions? > > > > > > On 09/05/2014 04:40 PM, Jim wrote: > > > >> Hello all, > >> > >> I've been experimenting with drill to load data into Parquet files. I > >> noticed rather large variability in the size of each parquet chunk. Is > >> there a way to control this? > >> > >> The documentation seems a little sparse on configuring some of the finer > >> details. My apologies if I missed something obvious. > >> > >> Thanks > >> Jim > >> > >> > > > > > -- > *Jim Scott* > Director, Enterprise Strategy & Architecture > > > [image: MapR Technologies] > --001a11c12af2868c3c05028fa31b--