beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Robertson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BEAM-3022) Enable the ability to grow partition count in the underlying Spark RDD
Date Thu, 05 Oct 2017 18:39:00 GMT

     [ https://issues.apache.org/jira/browse/BEAM-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Robertson updated BEAM-3022:
--------------------------------
    Description: 
When using a {{HadoopInputFormatIO}} the number of splits seems to be controlled by the underlying
{{InputFormat}} which in turn determines the number of partitions and therefore parallelisation
when running on Spark.  It is possible to {{Reshuffle}} the data to compensate for data skew,
but it _appears_ there is no way to grow the number of partitions.  The {{GroupCombineFunctions.reshuffle}}
seems to be the only place calling the Spark {{repartition}} and it uses the number of partitions
from the original RDD.

Scenarios that would benefit from this:
# Increasing parallelisation for computationally heavy stages
# ETLs where the input partitions are dictated by the source while you wish to optimise the
partitions for fast loading to the target sink
# Zip files (my case) where they are read in single threaded manner with a custom HadoopInputFormat
and therefore get a single task for all stages

(It would be nice if a user could supply a partitioner too, to help dictate data locality)

  was:
When using a {{HadoopInputFormatIO}} the number of splits seems to be controlled by the underlying
{{InputFormat}} which in turn determines the number of partitions and therefore parallelisation
when running on Spark.  It is possible to {{Reshuffle}} the data to compensate for data skew,
but it _appears_ there is no way to grow the number of partitions.  The {{GroupCombineFunctions.reshuffle}}
seems to be the only place calling the Spark {{repartition}} and it uses the number of partitions
from the original RDD.

Scenarios that would benefit from this:
# Increasing parallelisation for computationally heavy stages
# ETLs where the input partitions are dictated by the source while you wish to optimise the
partitions for fast loading to the target sink
# Zip files (my case) where they are read in single threaded manner with a custom HadoopInputFormat
and therefore get a single executor

(It would be nice if a user could supply a partitioner too, to help dictate data locality)


> Enable the ability to grow partition count in the underlying Spark RDD
> ----------------------------------------------------------------------
>
>                 Key: BEAM-3022
>                 URL: https://issues.apache.org/jira/browse/BEAM-3022
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Tim Robertson
>            Assignee: Amit Sela
>
> When using a {{HadoopInputFormatIO}} the number of splits seems to be controlled by the
underlying {{InputFormat}} which in turn determines the number of partitions and therefore
parallelisation when running on Spark.  It is possible to {{Reshuffle}} the data to compensate
for data skew, but it _appears_ there is no way to grow the number of partitions.  The {{GroupCombineFunctions.reshuffle}}
seems to be the only place calling the Spark {{repartition}} and it uses the number of partitions
from the original RDD.
> Scenarios that would benefit from this:
> # Increasing parallelisation for computationally heavy stages
> # ETLs where the input partitions are dictated by the source while you wish to optimise
the partitions for fast loading to the target sink
> # Zip files (my case) where they are read in single threaded manner with a custom HadoopInputFormat
and therefore get a single task for all stages
> (It would be nice if a user could supply a partitioner too, to help dictate data locality)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message