flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Kaplan <Shai.Kap...@microsoft.com>
Subject RE: Flink checkpointing gets stuck
Date Tue, 21 Feb 2017 14:37:55 GMT
Hi Vinay.

I couldn't understand from the thread, what configuration solved your problem?

I'm using the default predefined option. Perhaps it's not the best configuration for my setting
(I'm using Azure DS5_v2 machines), I honestly haven't given much thought to that particular
detail, but I think it should only affect the performance, not make the job totally stuck.

Thanks.

From: vinay patil [mailto:vinay18.patil@gmail.com]
Sent: Tuesday, February 21, 2017 3:58 PM
To: user@flink.apache.org
Subject: Re: Flink checkpointing gets stuck

Hi Shai,

I was facing similar issue , however now the stream is not stuck in between.
you can refer this thread for the configurations I have done : http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Checkpointing-with-RocksDB-as-statebackend-td11752.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FRe-Checkpointing-with-RocksDB-as-statebackend-td11752.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=z0YAi2n6itetqIfkD6tuOpHKQY0qbOLNUuAoYiQEWak%3D&reserved=0>

What is the configuration on which you running the job ?
What is the RocksDB predefined option you are using ?


Regards,
Vinay Patil

On Tue, Feb 21, 2017 at 7:13 PM, Shai Kaplan [via Apache Flink User Mailing List archive.]
<[hidden email]</user/SendEmail.jtp?type=node&node=11778&i=0>> wrote:
Hi.
I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After some running time
(minutes-hours) Flink fails to save checkpoints, and stops processing records (I'm not sure
if the checkpointing failure is the cause of the problem or just a symptom).
After several checkpoints that take some seconds each, they start failing due to 30 minutes
timeout.
When I restart one of the Task Manager services (just to get the job restarted), the job is
recovered from the last successful checkpoint (the state size continues to grow, so it's probably
not the reason for the failure), advances somewhat, saves some more checkpoints, and then
enters the failing state again.
One of the times it happened, the first failed checkpoint failed due to "Checkpoint Coordinator
is suspending.", so it might be an indicator for the cause of the problem, but looking into
Flink's code I can't see how a running job could get to this state.
I am using RocksDB for state, and the state is saved to Azure Blob Store, using the NativeAzureFileSystem
HDFS connector over the wasbs protocol.
Any ideas? Possibly a bug in Flink or RocksDB?

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-tp11776.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=Qt7qCSOvhSkzQA1y9ze13UqEotuWt0yKSQJ9gIV1DW8%3D&reserved=0>
To start a new topic under Apache Flink User Mailing List archive., email [hidden email]</user/SendEmail.jtp?type=node&node=11778&i=1>
To unsubscribe from Apache Flink User Mailing List archive., click here.
NAML<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2Ftemplate%2FNamlServlet.jtp%3Fmacro%3Dmacro_viewer%26id%3Dinstant_html%2521nabble%253Aemail.naml%26base%3Dnabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace%26breadcrumbs%3Dnotify_subscribers%2521nabble%253Aemail.naml-instant_emails%2521nabble%253Aemail.naml-send_instant_email%2521nabble%253Aemail.naml&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326824798&sdata=JdgLltimPhln4llGkLOpTCvHKy2GFVUC%2BuoM5gZOH4w%3D&reserved=0>


________________________________
View this message in context: Re: Flink checkpointing gets stuck<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2FFlink-checkpointing-gets-stuck-tp11776p11778.html&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=vtsd7KXC3G5zn3ZmCEyo0RYi16TJjrrzj%2FG8a%2BPBECs%3D&reserved=0>
Sent from the Apache Flink User Mailing List archive. mailing list archive<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-flink-user-mailing-list-archive.2336050.n4.nabble.com%2F&data=02%7C01%7CShai.Kaplan%40microsoft.com%7C70dc9d483010493b7fd308d45a623b0f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636232825326834807&sdata=JjFgdLMaCzZ9FcQ992QUZtnP%2BjxAZghzA7g05nBurLU%3D&reserved=0>
at Nabble.com.
Mime
View raw message