flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LINZ, Arnaud" <AL...@bouyguestelecom.fr>
Subject RE: Checkpoints and catch-up burst (heavy back pressure)
Date Thu, 28 Feb 2019 14:46:41 GMT
Hi Zhihiang,

Thanks for your feedback.

  *   I’ll try option 1 ; time out is 4min for now, I’ll switch it to 40min and will let
you know. Setting it higher than 40 min does not make much sense since after 40 min the pending
output is already quite large.
  *   Option 3 won’t work ; I already take too many ressources, and as my source is more
or less a hdfs directory listing, it will always be far faster than any mapper that reads
the file and emits records based on its content or sink that store the transformed data, unless
I put “sleeps” in it (but is this really a good idea?)
  *   Option 2: taskmanager.network.memory.buffers-per-channel and taskmanager.network.memory.buffers-per-gate
are currently unset in my configuration (so to their default of 2 and 8), but for this streaming
app I have very few exchanges between nodes (just a rebalance after the source that emit file
names, everything else is local to the node). Should I adjust their values nonetheless ? To
higher or lower values ?
Best,
Arnaud
De : zhijiang <wangzhijiang999@aliyun.com>
Envoyé : jeudi 28 février 2019 10:58
À : user <user@flink.apache.org>; LINZ, Arnaud <ALINZ@bouyguestelecom.fr>
Objet : Re: Checkpoints and catch-up burst (heavy back pressure)

Hi Arnaud,

I think there are two key points. First the checkpoint barrier might be emitted delay from
source under high backpressure for synchronizing lock.
Second the barrier has to be queued in flighting data buffers, so the downstream task has
to process all the buffers before barriers to trigger checkpoint and this would take some
time under back pressure.

There has three ways to work around:
1. Increase the checkpoint timeout avoid expire in short time.
2. Decrease the setting of network buffers to decrease the amount of flighting buffers before
barrier, you can check the config of  "taskmanager.network.memory.buffers-per-channel" and
"taskmanager.network.memory.buffers-per-gate".
3. Adjust the parallelism such as increasing it for sink vertex in order to process source
data faster, to avoid backpressure in some extent.

You could check which way is suitable for your scenario and may have a try.

Best,
Zhijiang
------------------------------------------------------------------
From:LINZ, Arnaud <ALINZ@bouyguestelecom.fr<mailto:ALINZ@bouyguestelecom.fr>>
Send Time:2019年2月28日(星期四) 17:28
To:user <user@flink.apache.org<mailto:user@flink.apache.org>>
Subject:Checkpoints and catch-up burst (heavy back pressure)

Hello,

I have a simple streaming app that get data from a source and store it to HDFS using a sink
similar to the bucketing file sink. Checkpointing mode is “exactly once”.
Everything is fine on a “normal” course as the sink is faster than the source; but when
we stop the application for a while and then restart it, we have a catch-up burst to get all
the messages emitted in the meanwhile.
During this burst, the source is faster than the sink, and all checkpoints fail (time out)
until the source has been totally caught up. This is annoying because the sink does not “commit”
the data before a successful checkpoint is made, and so the app release all the “catch up”
data as a atomic block that can be huge if the streaming app was stopped for a while, adding
an unwanted stress to all the following hive treatments that use the data provided in micro
batches and to the Hadoop cluster.

How should I handle the situation? Is there something special to do to get checkpoints even
during heavy load?

The problem does not seem to be new, but I was unable to find any practical solution in the
documentation.

Best regards,
Arnaud





________________________________

L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice
ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation
ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message,
merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent
this message cannot therefore be held liable for its content nor attachments. Any unauthorized
use or dissemination is prohibited. If you are not the intended recipient of this message,
then please delete it and notify the sender.

Mime
View raw message