spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lian Jiang <jiangok2...@gmail.com>
Subject read json and write into parquet in executors
Date Tue, 12 Mar 2019 02:52:21 GMT
Hi,

In my spark batch job,

step 1: the driver assigns a partition of json file path list to each
executor.
step 2: each executor gets these assigned json files from S3 and save into
hdfs.
step 3: the driver read these json files into a data frame and save into
parquet.

To improve performance by avoiding writing jsons to hdfs, I want to change
the workflow to:

step 1: the driver assigns a partition of json file path list to each
executor.
step 2: each executor gets these assigned json files from S3, merge the
json content in memory and directly write to parquet. No need to write
jsons to hdfs.

I cannot create dataframes in executors. Is this improvement feasible?
Appreciate any help!

Mime
View raw message