flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (FLINK-1443) Add replicated data source
Date Mon, 26 Jan 2015 13:46:35 GMT

     [ https://issues.apache.org/jira/browse/FLINK-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Fabian Hueske reassigned FLINK-1443:

    Assignee: Fabian Hueske

> Add replicated data source
> --------------------------
>                 Key: FLINK-1443
>                 URL: https://issues.apache.org/jira/browse/FLINK-1443
>             Project: Flink
>          Issue Type: New Feature
>          Components: Java API, JobManager, Optimizer
>    Affects Versions: 0.9
>            Reporter: Fabian Hueske
>            Assignee: Fabian Hueske
>            Priority: Minor
> This issue proposes to add support for data sources that read the same data in all parallel
instances. This feature can be useful, if the data is replicated to all machines in a cluster
and can be locally read. 
> For example, a replicated input format can be used for a broadcast join without sending
any data over the network.
> The following changes are necessary to achieve this:
> 1) Add a replicating InputSplitAssigner which assigns all splits to the all parallel
instances. This requires also to extend the InputSplitAssigner interface to identify the exact
parallel instance that requests an InputSplit (currently only the hostname is provided).
> 2) Make sure that the DOP of the replicated data source is identical to the DOP of its
> 3) Let the optimizer know that the data is replicated and ensure that plan enumeration
works correctly.

This message was sent by Atlassian JIRA

View raw message