spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chin Wei Low <lowchin...@gmail.com>
Subject Re: Spark app write too many small parquet files
Date Mon, 28 Nov 2016 14:01:02 GMT
Try limit the partitions. spark.sql.shuffle.partitions

This control the number of files generated.

On 28 Nov 2016 8:29 p.m., "Kevin Tran" <kevintvh@gmail.com> wrote:

> Hi Denny,
> Thank you for your inputs. I also use 128 MB but still too many files
> generated by Spark app which is only ~14 KB each ! That's why I'm asking if
> there is a solution for this if some one has same issue.
>
> Cheers,
> Kevin.
>
> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g.lee@gmail.com> wrote:
>
>> Generally, yes - you should try to have larger data sizes due to the
>> overhead of opening up files.  Typical guidance is between 64MB-1GB;
>> personally I usually stick with 128MB-512MB with the default of snappy
>> codec compression with parquet.  A good reference is Vida Ha's presentation Data
>> Storage Tips for Optimal Spark Performance
>> <https://spark-summit.org/2015/events/data-storage-tips-for-optimal-spark-performance/>.
>>
>>
>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevintvh@gmail.com> wrote:
>>
>>> Hi Everyone,
>>> Does anyone know what is the best practise of writing parquet file from
>>> Spark ?
>>>
>>> As Spark app write data to parquet and it shows that under that
>>> directory there are heaps of very small parquet file (such as
>>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is
>>> only 15KB
>>>
>>> Should it write each chunk of  bigger data size (such as 128 MB) with
>>> proper number of files ?
>>>
>>> Does anyone find out any performance changes when changing data size of
>>> each parquet file ?
>>>
>>> Thanks,
>>> Kevin.
>>>
>>
>

Mime
View raw message