flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taher Koitawala <taher.koitaw...@gslab.com>
Subject Re: How does flink read a DataSet?
Date Wed, 12 Sep 2018 11:30:37 GMT
So flink TMs reads one line at a time from hdfs in parallel and keep
filling it in memory and keep passing the records to the next operator? I
just want to know how data comes in memory? How it is partition between TMs
Is there a documentation i can refer how the reading is done and how data
is pushed from operators to operators in both stream and batch

On Wed 12 Sep, 2018, 4:28 PM Fabian Hueske, <fhueske@gmail.com> wrote:

> Actually, some parts of Flink's batch engine are similar to streaming as
> well. If the data does not need to be sorted or put into a hash-table, the
> data is pipelined (like in many relational database systems).
> For example, if you have a job that joins two inputs with a HashJoin, only
> the build side is marterialized in memory. If the build side fits in
> memory, the probe side if fully pipelined. If some parts of the build side
> need to be put on disk, the fraction of the probe side that would join with
> the spilled part is also written to disk. If the data needs to be sorted,
> Flink tries to do that in memory as well but can spill to disk if
> necessary. A job that only applies a filter or simple transformation would
> also be fully pipelined.
>
> So it depends on the job and its execution plan whether data is stored in
> memory or not.
>
> Best, Fabian
>
> 2018-09-12 2:34 GMT-04:00 vino yang <yanghua1127@gmail.com>:
>
>> Hi Taher,
>>
>> Stream processing and batch processing are very different. The principle
>> of batch processing determines that it needs to process bulk data, such as
>> memory-based sorting, join, and so on. So, in this case, it needs to wait
>> for the relevant data to arrive before it is calculated, but this does not
>> mean that the data is concentrated in one node, and the calculation is
>> still distributed. Flink has corresponding optimization measures for the
>> execution plan of batch processing. For the storage of large data sets, it
>> uses a custom-managed memory mechanism (you can use more memory by applying
>> extra-heap memory). Of course, the amount of data is still stored in the
>> memory. It will spill to disk when not in use.
>>
>> Regarding fault tolerance, the current checkpoint mechanism is only
>> applicable to stream processing. Batch fault tolerance can be re-executed
>> by directly playing back the complete data set. A TaskManager fails, Flink
>> will kick it out of the cluster, and the Task running on it will fail, but
>> the result of stream processing and batch Task failure is different. For
>> stream processing, it triggers a restart of the entire job, which may only
>> trigger a partial restart for batch processing.
>>
>> Thanks, vino.
>>
>> Taher Koitawala <taher.koitawala@gslab.com> 于2018年9月12日周三 上午1:50写道:
>>
>>> Furthermore, how does Flink deal with Task Managers dying when it is
>>> using the DataSet API. Is checkpointing done on dataset too? Or the whole
>>> dataset has to re-read.
>>>
>>> Regards,
>>> Taher Koitawala
>>> GS Lab Pune
>>> +91 8407979163
>>>
>>> On Tue, Sep 11, 2018 at 11:18 PM, Taher Koitawala <
>>> taher.koitawala@gslab.com> wrote:
>>>
>>>> Hi All,
>>>>          Just like Spark does Flink read a dataset and keep it in
>>>> memory and keep applying transformations? Or all records read by Flink
>>>> async parallel reads? Furthermore, how does Flink deal with
>>>>
>>>> Regards,
>>>> Taher Koitawala
>>>> GS Lab Pune
>>>> +91 8407979163
>>>>
>>>
>>>
>

Mime
View raw message