flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: How to restart/recover on reboot?
Date Tue, 18 Jun 2019 20:22:33 GMT
I guess it should work if you installed a systemd service which simply
calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,
Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <java.dev.mtl@gmail.com> wrote:

> Yes, that is understood. But I don't see why we cannot call jobmanager.sh
> and taskmanager.sh to build the cluster and have them run as systemd units.
>
> I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh
> which then cascades to taskmanager.sh I just have to pin point what's
> missing to have systemd service working. In fact calling jobmanager.sh as
> systemd service actually sees the shared masters, slaves and
> flink-conf.yaml. But it binds to local host.
>
> Maybe one way to do it would be to bootstrap the cluster with
> ./start-cluster.sh and then install systemd services for jobmanager.sh and
> tsakmanager.sh
>
> Like I said I don't want to have some process in place to remind admins
> they need to manually start a node every time they patch or a host goes
> down for what ever reason.
>
> On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <trohrmann@apache.org> wrote:
>
>> When a single machine fails you should rather call `taskmanager.sh
>> start`/`jobmanager.sh start` to start a single process. `start-cluster.sh`
>> will start multiple processes on different machines.
>>
>> Cheers,
>> Till
>>
>> On Mon, Jun 17, 2019 at 4:30 PM John Smith <java.dev.mtl@gmail.com>
>> wrote:
>>
>>> Well some reasons, machine reboots/maintenance etc... Host/VM crashes
>>> and restarts. And same goes for the job manager. I don't want/need to have
>>> to document/remember some start process for sys admins/devops.
>>>
>>> So far I have looked at ./start-cluster.sh and all it seems to do is SSH
>>> into all the specified nodes and starts the processes using the jobmanager
>>> and taskmanager scripts. I don't see anything special in any of the sh
>>> scripts.
>>> I configured passwordless ssh through terraform and all that works great
>>> only when trying to do the manual start through systemd. I may have
>>> something missing...
>>>
>>>
>>>
>>> On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <trohrmann@apache.org>
>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>> I have not much experience wrt setting Flink up via systemd services.
>>>> Why do you want to do it like that?
>>>>
>>>> 1. In standalone mode, Flink won't automatically restart TaskManagers.
>>>> This only works on Yarn and Mesos atm.
>>>> 2. In case of a lost TaskManager, you should run `taskmanager.sh
>>>> start`. This script simply starts a new TaskManager process.
>>>> 3. I guess you could use systemd to bring up a Flink TaskManager
>>>> process on start up.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Fri, Jun 14, 2019 at 5:56 PM John Smith <java.dev.mtl@gmail.com>
>>>> wrote:
>>>>
>>>>> I looked into the start-cluster.sh and I don't see anything special.
>>>>> So technically it should be as easy as installing Systemd services to
run
>>>>> jobamanger.sh and taskmanager.sh respectively?
>>>>>
>>>>> On Wed, 12 Jun 2019 at 13:02, John Smith <java.dev.mtl@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The installation instructions do not indicate how to create systemd
>>>>>> services.
>>>>>>
>>>>>> 1- When task nodes fail, will the job leader detect this and ssh
and
>>>>>> restart the task node? From my testing it doesn't seem like it.
>>>>>> 2- How do we recover a lost node? Do we simply go back to the master
>>>>>> node and run start-cluster.sh and the script is smart enough to figure
out
>>>>>> what is missing?
>>>>>> 3- Or do we need to create systemd services and if so on which
>>>>>> command do we start the service on?
>>>>>>
>>>>>

Mime
View raw message