spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <>
Subject Re: increasing concurrency of saveAsNewAPIHadoopFile?
Date Thu, 19 Jun 2014 21:26:39 GMT
The main thing that will affect the concurrency of any saveAs...()
operations is a) the number of partitions of your RDD, and b) how many
cores your cluster has.

How big is the RDD in question? How many partitions does it have?

On Thu, Jun 19, 2014 at 3:38 PM, Sandeep Parikh <>

> I'm trying to write a JavaPairRDD to a downstream database using
> saveAsNewAPIHadoopFile with a custom OutputFormat and the process is pretty
> slow.
> Is there a way to boost the concurrency of the save process? For example,
> something like splitting the RDD into multiple smaller RDDs and using Java
> threads to write the data out? That seems foreign to the way Spark works so
> not sure if there's a better way.

View raw message