flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Data Locality in Flink
Date Mon, 29 Apr 2019 13:10:09 GMT
Hi Flavio,

These typos of race conditions are not failure cases, so no exception is
It only means that a single source tasks reads all (or most of the) splits
and no splits are left for the other tasks.
This can be a problem if a record represents a large amount of IO or an
intensive computation as they might not be properly distributed. In that
case you'd need to manually rebalance the partitions of a DataSet.


Am Mo., 29. Apr. 2019 um 14:42 Uhr schrieb Flavio Pompermaier <

> Hi Fabian, I wasn't aware that  "race-conditions may happen if your splits
> are very small as the first data source task might rapidly request and
> process all splits before the other source tasks do their first request".
> What happens exactly when a race-condition arise? Is this exception
> internally handled by Flink or not?
> On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <fhueske@gmail.com> wrote:
>> Hi,
>> The method that I described in the SO answer is still implemented in
>> Flink.
>> Flink tries to assign splits to tasks that run on local TMs.
>> However, files are not split per line (this would be horribly
>> inefficient) but in larger chunks depending on the number of subtasks (and
>> in case of HDFS the file block size).
>> Best, Fabian
>> Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani <
>> soheil.ir08@gmail.com>:
>>> Hi
>>> I want to exactly how Flink read data in the both case of file in local
>>> filesystem and file on distributed file system?
>>> In reading data from local file system I guess every line of the file
>>> will be read by a slot (according to the job parallelism) for applying the
>>> map logic.
>>> In reading from HDFS I read this
>>> <https://stackoverflow.com/a/39153402/8110607> answer by Fabian Hueske
>>> <https://stackoverflow.com/users/3609571/fabian-hueske> and i want to
>>> know is that still the Flink strategy fro reading from distributed system
>>> file?
>>> thanks

View raw message