flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chesnay Schepler (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-20195) Jobs endpoint returns duplicated jobs
Date Mon, 30 Nov 2020 11:15:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-20195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17240680#comment-17240680
] 

Chesnay Schepler commented on FLINK-20195:
------------------------------------------

Is there not a small window where the job is in a terminal state but the Dispatcher logic
for terminal jobs have not been executed yet?
In {{JobMaster#jobStatusChanged}} the notification to the dispatcher is executed by the {{scheduledExecutorService}},
so it should be possible for a request from the REST API to be processed.

- job terminates but dispatcher does not know it yet
- REST API queries DispatcherJobs, and receives the terminated yet
- Dispatcher is notified, archives job and cleans up DispatcherJob
- REST API queries ExecutionGraphStore, receives the same terminated job

> Jobs endpoint returns duplicated jobs
> -------------------------------------
>
>                 Key: FLINK-20195
>                 URL: https://issues.apache.org/jira/browse/FLINK-20195
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / REST
>    Affects Versions: 1.11.2
>            Reporter: Ingo Bürk
>            Priority: Minor
>             Fix For: 1.12.0
>
>
> The GET /jobs endpoint can, for a split second, return a duplicated job after it has
been cancelled. This occurred in Ververica Platform after canceling a job (using PATCH /jobs/\{jobId})
and calling GET /jobs.
> I've reproduced this and queried the endpoint in a relatively tight loop (~ every 0.5s)
to log the responses of GET /jobs and got this:
>  
>  
> {code:java}
> …
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"RUNNING"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELLING"}]}
> {"jobs":[{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> {"jobs":[{"id":"53fd11db25394308862c997dce9ef990","status":"CANCELED"},{"id":"e110531c08dd4e3dbbfcf7afc1629c3d","status":"FAILED"}]}
> …{code}
>  
> You can see in in between that for just a moment, the endpoint returned the same Job
ID twice.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message