spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Dataframe Partitioning
Date Tue, 01 Mar 2016 23:43:06 GMT
If you have to pick a number, its better to over estimate than
underestimate since task launching in spark is relatively cheap compared to
spilling to disk or OOMing (now much less likely due to Tungsten).
Eventually, we plan to make this dynamic, but you should tune for your
particular workload.

On Tue, Mar 1, 2016 at 3:19 PM, Teng Liao <tliao@palantir.com> wrote:

> Hi,
>
> I was wondering what the rationale behind defaulting all repartitioning to
> spark.sql.shuffle.partitions is. I’m seeing a huge overhead when running a
> job whose input partitions is 2 and, using the default value for
> spark.sql.shuffle.partitions, this is now 200. Thanks.
>
> -Teng Fei Liao
>

Mime
View raw message