spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Resolved] (SPARK-9072) Parquet : Writing data to S3 very slowly
Date Wed, 15 Jul 2015 18:40:04 GMT


Sean Owen resolved SPARK-9072.
       Resolution: Invalid
    Fix Version/s:     (was: 1.5.0)

[~mkanchwala] Please read
 There are some problems here (don't set critical; you shouldn't have set fix version). But
this also looks like a question and something you're investigating. It's not suitable as a
JIRA since you don't have a clear issue to report.

> Parquet : Writing data to S3 very slowly
> ----------------------------------------
>                 Key: SPARK-9072
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Murtaza Kanchwala
>            Priority: Critical
>              Labels: parquet
> I've created spark programs through which I am converting the normal textfile to parquet
and csv to S3.
> There is around 8 TB of data and I need to compress it into lower for further processing
on Amazon EMR
> Results : 
> 1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems successfully
to S3.
> 2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still after the
Job completion it is spilling/writing the data separately to S3 which is making it slower
and in starvation.
> Input : s3n://<SameBucket>/input
> Output : s3n://<SameBucket>/output/parquet
> Lets say If I have around 10K files then it is taking 1000 files / 20 min to write back
in S3.
> Note : 
> Also I found that program is creating temp folder on S3 output location, And in Logs
I've seen S3ReadDelays.
> Can anyone tell me what am I doing wrong? or is there anything I need to add so that
the Spark App cant create temp folder on S3 and do write ups fast from EMR to S3 just like
saveAsTextFile. Thanks

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message