beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ismaël Mejía (JIRA) <j...@apache.org>
Subject [jira] [Commented] (BEAM-673) Data locality for Read.Bounded
Date Wed, 26 Apr 2017 20:53:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985529#comment-15985529
] 

Ismaël Mejía commented on BEAM-673:
-----------------------------------

Oh you have a point [~lcwik], I haven’t thought about this like a general problem. Notice
that the case you mention is a common task of the resource managers e.g. YARN, Mesos or Kubernetes,
in them there is the concept of resource/offers and the underlying processing system e.g.
Hadoop, Spark, Flink just tell them their preferences to allocate the workers.

[~jkff] I agree with this aspect of data dependency that you mention, in the case of this
JIRA the sources are the ones that know this information that the runner needs to pass to
the given resource manager, but probably it could be the case that it would be a specific
transform e.g. A ML specific transform could hint the need of GPU as Luke mentioned.

This definitely deserves extra research and a more formal design to cover this more general
scenario, so I am moving it out of the FSR list and I will also create a new JIRA for the
more general case and let this for the particular case of Data Locality for the Spark runner.


> Data locality for Read.Bounded
> ------------------------------
>
>                 Key: BEAM-673
>                 URL: https://issues.apache.org/jira/browse/BEAM-673
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Amit Sela
>            Assignee: Ismaël Mejía
>
> In some distributed filesystems, such as HDFS, we should be able to hint to Spark the
preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message