flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: How to restart/recover on reboot?
Date Tue, 18 Jun 2019 08:30:39 GMT
When a single machine fails you should rather call `taskmanager.sh
start`/`jobmanager.sh start` to start a single process. `start-cluster.sh`
will start multiple processes on different machines.


On Mon, Jun 17, 2019 at 4:30 PM John Smith <java.dev.mtl@gmail.com> wrote:

> Well some reasons, machine reboots/maintenance etc... Host/VM crashes and
> restarts. And same goes for the job manager. I don't want/need to have to
> document/remember some start process for sys admins/devops.
> So far I have looked at ./start-cluster.sh and all it seems to do is SSH
> into all the specified nodes and starts the processes using the jobmanager
> and taskmanager scripts. I don't see anything special in any of the sh
> scripts.
> I configured passwordless ssh through terraform and all that works great
> only when trying to do the manual start through systemd. I may have
> something missing...
> On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <trohrmann@apache.org> wrote:
>> Hi John,
>> I have not much experience wrt setting Flink up via systemd services. Why
>> do you want to do it like that?
>> 1. In standalone mode, Flink won't automatically restart TaskManagers.
>> This only works on Yarn and Mesos atm.
>> 2. In case of a lost TaskManager, you should run `taskmanager.sh start`.
>> This script simply starts a new TaskManager process.
>> 3. I guess you could use systemd to bring up a Flink TaskManager process
>> on start up.
>> Cheers,
>> Till
>> On Fri, Jun 14, 2019 at 5:56 PM John Smith <java.dev.mtl@gmail.com>
>> wrote:
>>> I looked into the start-cluster.sh and I don't see anything special. So
>>> technically it should be as easy as installing Systemd services to run
>>> jobamanger.sh and taskmanager.sh respectively?
>>> On Wed, 12 Jun 2019 at 13:02, John Smith <java.dev.mtl@gmail.com> wrote:
>>>> The installation instructions do not indicate how to create systemd
>>>> services.
>>>> 1- When task nodes fail, will the job leader detect this and ssh and
>>>> restart the task node? From my testing it doesn't seem like it.
>>>> 2- How do we recover a lost node? Do we simply go back to the master
>>>> node and run start-cluster.sh and the script is smart enough to figure out
>>>> what is missing?
>>>> 3- Or do we need to create systemd services and if so on which command
>>>> do we start the service on?

View raw message