beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ismaël Mejía (JIRA) <j...@apache.org>
Subject [jira] [Commented] (BEAM-673) Data locality for Read.Bounded
Date Tue, 25 Apr 2017 19:54:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983513#comment-15983513
] 

Ismaël Mejía commented on BEAM-673:
-----------------------------------

[~jkff] Do you have any concrete idea of the best way to do this inside of the DoFn? I ask
because this is the whole goal of this JIRA and data locality is extremely valued in the Hadoop/Spark
world. For Beam I just thought that we could achieve this just by hacking a bit Read.Bounded
+ adding an extra method to the Source API so the runners toke the locality hints from the
sources, but since you have played by far more with IO maybe you can give me (us) some ideas.

> Data locality for Read.Bounded
> ------------------------------
>
>                 Key: BEAM-673
>                 URL: https://issues.apache.org/jira/browse/BEAM-673
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Amit Sela
>            Assignee: Ismaël Mejía
>             Fix For: First stable release
>
>
> In some distributed filesystems, such as HDFS, we should be able to hint to Spark the
preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message