Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AA91418B4E for ; Wed, 7 Oct 2015 19:18:43 +0000 (UTC) Received: (qmail 83202 invoked by uid 500); 7 Oct 2015 19:18:39 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 83109 invoked by uid 500); 7 Oct 2015 19:18:39 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 83099 invoked by uid 99); 7 Oct 2015 19:18:39 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Oct 2015 19:18:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 2F9DF180E0F for ; Wed, 7 Oct 2015 19:18:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.89 X-Spam-Level: ** X-Spam-Status: No, score=2.89 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id F7g6CUgIJkb2 for ; Wed, 7 Oct 2015 19:18:31 +0000 (UTC) Received: from mail-pa0-f49.google.com (mail-pa0-f49.google.com [209.85.220.49]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 77711439BA for ; Wed, 7 Oct 2015 19:18:31 +0000 (UTC) Received: by pacex6 with SMTP id ex6so29543587pac.0 for ; Wed, 07 Oct 2015 12:18:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-type; bh=diKbbvlQkZI+bWnFvEv7GQ6aV7onwfZLwwJjiN6m6H8=; b=vSaEwWwnL9LYwMJF0wrRwC0ZLt92KaOEy3JlpjrPgohPx+PdC9Qe5bYiuKCpdgYyvn 9XFJanZCuTRP0uTHuEskRNg66DOL/CbALUZ1CfMiPHaswRY5vSek2JJIeZNlXHbRF4Ol CCQoHTr/XVFLR7ujr5IoK7CWQMyckJ0rlNGgJ581GaFnKMxyJWvCqrnrCFXAVulXbhps 4Sz+grsYcbB7GpVp0sLLWcQSWEY9Kdt15b+FVOg5JeKhFsHz1gAXExcJJmrWevsh6iJY 1KqS3hs1lgPuDqjvlipPzWsHPyG5GbmZQpn9QBlPvk9+wNqSoOsClX+2P1ZuukgVXEu+ hwSg== X-Received: by 10.66.190.41 with SMTP id gn9mr2920549pac.0.1444245510752; Wed, 07 Oct 2015 12:18:30 -0700 (PDT) Received: from [192.168.0.28] (DATABRICKS.bar1.SanFrancisco1.Level3.net. [4.15.73.18]) by smtp.googlemail.com with ESMTPSA id gw3sm40927026pbc.46.2015.10.07.12.18.29 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 07 Oct 2015 12:18:30 -0700 (PDT) Subject: Re: Parquet file size To: Younes Naguib , "'user@spark.apache.org'" References: <8520F857C3D52C47ACE0A7456DF2B672BF492BD4@MTL-XCH01-BE01.tritondigital.int> From: Cheng Lian Message-ID: <56157005.4090101@gmail.com> Date: Wed, 7 Oct 2015 12:18:29 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <8520F857C3D52C47ACE0A7456DF2B672BF492BD4@MTL-XCH01-BE01.tritondigital.int> Content-Type: multipart/alternative; boundary="------------010604000607090306010300" --------------010604000607090306010300 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: > > Hi, > > I�m reading a large tsv file, and creating parquet files using sparksql: > > insert overwrite > > table tbl partition(year, month, day).... > > Select .... from tbl_tsv; > > This works nicely, but generates small parquet files (15MB). > > I wanted to generate larger files, any idea how to address this? > > *Thanks,* > > *Younes Naguib*** > > Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC H3G 1R8 > > Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | > younes.naguib@tritondigital.com > --------------010604000607090306010300 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 8bit Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file?

Cheng

On 10/7/15 11:07 AM, Younes Naguib wrote:

Hi,

I�m reading a large tsv file, and creating parquet files using sparksql:

insert overwrite

table tbl partition(year, month, day)....

Select .... from tbl_tsv;

This works nicely, but generates small parquet files (15MB).

I wanted to generate larger files, any idea how to address this?

Thanks,

Younes Naguib

Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC� H3G 1R8

Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib@tritondigital.com


--------------010604000607090306010300--