flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Metzger (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-1287) Improve File Input Split assignment
Date Wed, 26 Nov 2014 16:03:12 GMT
Robert Metzger created FLINK-1287:
-------------------------------------

             Summary: Improve File Input Split assignment
                 Key: FLINK-1287
                 URL: https://issues.apache.org/jira/browse/FLINK-1287
             Project: Flink
          Issue Type: Improvement
          Components: Local Runtime
            Reporter: Robert Metzger


While running some DFS read-intensive benchmarks, I found that the assignment of input splits
is not optimal. In particular in cases where the numWorker != numDataNodes and when the replication
factor is low (in my case it was 1).

In the particular example, the input had 40960 splits, of which 4694 were read remotely. 
Spark did only 2056 remote reads for the same dataset.
With the replication factor increased to 2, Flink did only 290 remote reads. So usually, users
shouldn't be affected by this issue.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message