beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Robertson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-3192) Be able to specify the Spark Partitioner via the pipeline options
Date Wed, 15 Nov 2017 10:06:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253214#comment-16253214
] 

Tim Robertson commented on BEAM-3192:
-------------------------------------

A use case for this is when one has iterative algorithms requiring merging of RDDs.  There
are cases when you can make significant performance improvements by being able to colocate
the RDDs that will be merged.

One implementation is the maps on GBIF.org (e.g. [Animals|https://www.gbif.org/species/1],
[Birds|https://www.gbif.org/species/212], [Sparrows|https://www.gbif.org/species/2492321])
which are recalculated every few hours in Spark jobs coordinated by Oozie, and persisted in
HBase.  This relies on using Spark partitioning to [merge zoom levels up to world views|https://github.com/gbif/maps/blob/master/spark-process/src/main/scala/org/gbif/maps/spark/BackfillTiles.scala#L142]
efficiently.  

Another use case might be building HFiles offline in Spark for [efficient loading into HBase|http://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/]
which requires a {{repartitionAndSortWithinPartition}} operation.

> Be able to specify the Spark Partitioner via the pipeline options
> -----------------------------------------------------------------
>
>                 Key: BEAM-3192
>                 URL: https://issues.apache.org/jira/browse/BEAM-3192
>             Project: Beam
>          Issue Type: New Feature
>          Components: runner-spark
>            Reporter: Jean-Baptiste Onofré
>            Assignee: Jean-Baptiste Onofré
>
> As we did for the StorageLevel, it would be great for an user to be able to provide the
Spark partitionner via PipelineOptions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message