spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-9072) Parquet : Writing data to S3 very slowly
Date Wed, 15 Jul 2015 18:40:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-9072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-9072.
------------------------------
       Resolution: Invalid
    Fix Version/s:     (was: 1.5.0)

[~mkanchwala] Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 There are some problems here (don't set critical; you shouldn't have set fix version). But
this also looks like a question and something you're investigating. It's not suitable as a
JIRA since you don't have a clear issue to report.

> Parquet : Writing data to S3 very slowly
> ----------------------------------------
>
>                 Key: SPARK-9072
>                 URL: https://issues.apache.org/jira/browse/SPARK-9072
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Murtaza Kanchwala
>            Priority: Critical
>              Labels: parquet
>
> I've created spark programs through which I am converting the normal textfile to parquet
and csv to S3.
> There is around 8 TB of data and I need to compress it into lower for further processing
on Amazon EMR
> Results : 
> 1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems successfully
to S3.
> 2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still after the
Job completion it is spilling/writing the data separately to S3 which is making it slower
and in starvation.
> Input : s3n://<SameBucket>/input
> Output : s3n://<SameBucket>/output/parquet
> Lets say If I have around 10K files then it is taking 1000 files / 20 min to write back
in S3.
> Note : 
> Also I found that program is creating temp folder on S3 output location, And in Logs
I've seen S3ReadDelays.
> Can anyone tell me what am I doing wrong? or is there anything I need to add so that
the Spark App cant create temp folder on S3 and do write ups fast from EMR to S3 just like
saveAsTextFile. Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message