flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-3011) Cannot cancel failing/restarting streaming job from the command line
Date Tue, 17 Nov 2015 14:22:11 GMT

    [ https://issues.apache.org/jira/browse/FLINK-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15008719#comment-15008719

ASF GitHub Bot commented on FLINK-3011:

GitHub user uce opened a pull request:


    [FLINK-3011, 3019, 3028] Cancel jobs in RESTARTING state

    This addresses issues with cancelling jobs, which are in the `RESTARTING` state. A job
enters this state  after a failure as soon as all job vertices are in their final state. It
then stays in this state until it is redeployed (e.g. default 100s currently). In this state,
the job cannot be cancelled. If the failure is permanent (for example missing slots), the
job can never be cancelled.
    This PR includes changes to the ExecutionGraph and to the clients:
    **ExecutionGraph** (FLINK-3011)
    - Remove the state transition from `FAILED` to `RESTARTING` in `restart()`. This was breaking
the semantics of `FAILED` being a terminal state. It was only relevant for a test as far as
I can tell.
    - When cancelling during restarts, two job states are relevant:
      - `RESTARTING`: try to set the state directly to `CANCELED` as all vertices have been
already failed when the job enters the `RESTARTING` state. If the state transition to `CANCELED`
succeeds, the restart will be ignored with a log message.
      - `FAILING`: try to set the state to `CANCELLING` and wait for the failing of the vertices
to finish. This will finish the cancellation as usual in `jobVertexInFinalState()`. 
    When reviewing the `cancel()`, `jobVertexInFinalState()`, and `restart()` methods are
    **CLIFrontend** (FLINK-3019)
    - List restarting jobs with scheduled jobs
    $ bin/flink list
    No running jobs.
    ---------------- Scheduled/Restarting Jobs -------------------
    17.11.2015 15:14:01 : 4b3fa06c88e5a2a4963241e7afca7b7d : Streaming WordCount (RESTARTING)
    **WebFrontend** (FLINK-3028)
    - Show the cancel button if the job is restarting. It was only displayed for running or
created jobs before.
    I want to merge this for 0.10.1 and 1.0.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 3011-restart

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1369
commit 0c5a3306808bec5b9a833703adbcd9f45bbe6de5
Author: Ufuk Celebi <uce@apache.org>
Date:   2015-11-16T15:18:20Z

    [FLINK-3011] [runtime] Disallow ExecutionGraph state transition from FAILED to RESTARTING
    Removes the possibility to go from FAILED state back to RESTARTING. This was only used
in a test
    case. It was a breaking the terminal state semantics of the FAILED state.

commit 19c602b2ce7686237d8611645a4662aa2b2a0cef
Author: Ufuk Celebi <uce@apache.org>
Date:   2015-11-17T10:40:54Z

    [FLINK-3011] [runtime, tests] Translate ExecutionGraphRestartTest to Java

commit e13dd1bac7029af6ae4157af226131a10f5d02d0
Author: Ufuk Celebi <uce@apache.org>
Date:   2015-11-17T10:56:42Z

    [FLINK-3011] [runtime] Fix cancel during restart

commit 657e34f31fe9c6325900f42c36257b5c5d2019be
Author: Ufuk Celebi <uce@apache.org>
Date:   2015-11-17T13:11:44Z

    [FLINK-3019] [client] List restarting jobs with scheduled jobs

commit 8b2850610aff1197d204bdb7d790df8fb6b5df4c
Author: Ufuk Celebi <uce@apache.org>
Date:   2015-11-17T13:51:15Z

    [FLINK-3028] [runtime-web] Show cancel button for restarting jobs


> Cannot cancel failing/restarting streaming job from the command line
> --------------------------------------------------------------------
>                 Key: FLINK-3011
>                 URL: https://issues.apache.org/jira/browse/FLINK-3011
>             Project: Flink
>          Issue Type: Bug
>          Components: Command-line client
>    Affects Versions: 0.10.0, 1.0.0
>            Reporter: Gyula Fora
>            Assignee: Ufuk Celebi
>            Priority: Critical
> I cannot seem to be able to cancel a failing/restarting job from the command line client.
The job cannot be rescheduled so it keeps failing:
> The exception I get:
> 13:58:11,240 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status
of job 0c895d22c632de5dfe16c42a9ba818d5 (player-id) changed to RESTARTING.
> 13:58:25,234 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Trying
to cancel job with ID 0c895d22c632de5dfe16c42a9ba818d5.
> 13:58:25,561 WARN  akka.remote.ReliableDeliverySupervisor                        - Association
with remote system [akka.tcp://flink@] has failed, address is now gated for
[5000] ms. Reason is: [Disassociated].

This message was sent by Atlassian JIRA

View raw message