spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Younes Naguib <Younes.Nag...@tritondigital.com>
Subject RE: Parquet file size
Date Wed, 07 Oct 2015 20:55:38 GMT
The TSV original files is 600GB and generated 40k files of 15-25MB.

y

From: Cheng Lian [mailto:lian.cs.zju@gmail.com]
Sent: October-07-15 3:18 PM
To: Younes Naguib; 'user@spark.apache.org'
Subject: Re: Parquet file size

Why do you want larger files? Doesn't the result Parquet file contain all the data in the
original TSV file?

Cheng
On 10/7/15 11:07 AM, Younes Naguib wrote:
Hi,

I'm reading a large tsv file, and creating parquet files using sparksql:
insert overwrite
table tbl partition(year, month, day)....
Select .... from tbl_tsv;

This works nicely, but generates small parquet files (15MB).
I wanted to generate larger files, any idea how to address this?

Thanks,
Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib@tritondigital.com
<mailto:younes.naguib@streamtheworld.com>



Mime
View raw message