flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yun Tang <myas...@live.com>
Subject Re: In consistent Check point API response
Date Tue, 26 May 2020 02:44:42 GMT
Hi Bhaskar

It seems I still not understand your case-5 totally. Your job failed 6 times, and recover
from previous checkpoint to restart again. However, you found the REST API told the wrong
answer.
How do you ensure your "restored" field is giving the wrong checkpoint file which is not latest?
Have you ever checked the log in JM to view related contents: "Restoring job xxx from latest
valid checkpoint: x@xxxx" [1] to know exactly which checkpoint choose to restore?

I think you could give a more concrete example e.g. which expected/actual checkpoint to restore,
to tell your story.

[1] https://github.com/apache/flink/blob/8f992e8e868b846cf7fe8de23923358fc6b50721/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1250

Best
Yun Tang
________________________________
From: Vijay Bhaskar <bhaskar.ebay77@gmail.com>
Sent: Monday, May 25, 2020 17:01
To: Yun Tang <myasuka@live.com>
Cc: user <user@flink.apache.org>
Subject: Re: In consistent Check point API response

Thanks Yun.
Here is the problem i am facing:

I am using  jobs/:jobID/checkpoints  API to recover the failed job. We have the remote manager
which monitors the jobs.  We are using "restored" field of the API response to get the latest
check point file to use. Its giving correct checkpoint file for all the 4 cases except the
5'th case. Where the "restored" field is giving the wrong check point file which is not latest.
 When we compare the  check point file returned by  the "completed". field, both are giving
identical checkpoints in all 4 cases, except 5'th case
We can't use flink UI in because of security reasons

Regards
Bhaskar

On Mon, May 25, 2020 at 12:57 PM Yun Tang <myasuka@live.com<mailto:myasuka@live.com>>
wrote:
Hi Vijay

If I understand correct, do you mean your last "restored" checkpoint is null via REST api
when the job failed 6 times and then recover successfully with another several successful
checkpoints?

First of all, if your job just recovered successfully, can you observe the "last restored"
checkpoint in web UI?
Secondly, how long will you cannot see the "restored " field  after recover successfully?
Last but not least, I cannot see the real difference among your cases, what's the core difference
in your case(5)?

>From the implementation of Flink, it will create the checkpoint statics without restored
checkpoint and assign it once the latest savepoint/checkpoint is restored. [1]

[1] https://github.com/apache/flink/blob/50253c6b89e3c92cac23edda6556770a63643c90/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1285

Best
Yun Tang

________________________________
From: Vijay Bhaskar <bhaskar.ebay77@gmail.com<mailto:bhaskar.ebay77@gmail.com>>
Sent: Monday, May 25, 2020 14:20
To: user <user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: In consistent Check point API response

Hi
I am using flink retained check points and along with   jobs/:jobid/checkpoints API for retrieving
the latest retained check point
Following the response of Flink Checkpoints API:

I have my jobs restart attempts are 5
 check point API response in "latest" key, check point file name of both "restored" and "completed"
values are having following behavior
1)Suppose the job is failed 3 times and recovered 4'th time, then both values are same
2)Suppose the job is failed 4 times and recovered 5'th time, then both values are same
3)Suppose the job is failed 5 times and recovered 6'th time, then both values are same
4) Suppose the job is failed all 6 times and the job marked failed. then also both the values
are same
5)Suppose job is failed 6'th time , after recovering from 5 attempts and made few check points,
then both values are different.

During case (1), case (2), case (3) and case (4) i never had any issue. Only When case (5)
i had severe issue in my production as the "restored " field check point doesn't exist

Please suggest any



{
   "counts":{
      "restored":6,
      "total":3,
      "in_progress":0,
      "completed":3,
      "failed":0
   },
   "summary":{
      "state_size":{
         "min":4879,
         "max":4879,
         "avg":4879
      },
      "end_to_end_duration":{
         "min":25,
         "max":130,
         "avg":87
      },
      "alignment_buffered":{
         "min":0,
         "max":0,
         "avg":0
      }
   },
   "latest":{
      "completed":{
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      "savepoint":null,
      "failed":null,
      "restored":{
         "id":7093,
         "restore_timestamp":1590382478448,
         "is_savepoint":false,
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093"
      }
   },
   "history":[
      {
         "@class":"completed",
         "id":7094,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382502772,
         "latest_ack_timestamp":1590382502902,
         "state_size":4879,
         "end_to_end_duration":130,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7094",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7093,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382310195,
         "latest_ack_timestamp":1590382310220,
         "state_size":4879,
         "end_to_end_duration":25,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7093",
         "discarded":false
      },
      {
         "@class":"completed",
         "id":7092,
         "status":"COMPLETED",
         "is_savepoint":false,
         "trigger_timestamp":1590382190195,
         "latest_ack_timestamp":1590382190303,
         "state_size":4879,
         "end_to_end_duration":108,
         "alignment_buffered":0,
         "num_subtasks":2,
         "num_acknowledged_subtasks":2,
         "tasks":{

         },
         "external_path":"file:/var/lib/persist/flink/checkpoints/29ae7600aa4f7d53a0dc1a0a7b257c85/chk-7092",
         "discarded":true
      }
   ]
}


Mime
View raw message