flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Smith <java.dev....@gmail.com>
Subject Re: [EXTERNAL] Re: How to restart/recover on reboot?
Date Wed, 19 Jun 2019 15:59:06 GMT
Ok I tried it works! I can setup my cluster with terraform and enable
systemd services! i think I got confused when I looked and it was doing
leader election because all service came up quick!



On Tue, 18 Jun 2019 at 22:35, John Smith <java.dev.mtl@gmail.com> wrote:

> Ah ok we need to pass --host. The command line help sais jobmanager.sh
> <host>?!?! If I recall. I have to go check tomorrow...
>
> On Tue., Jun. 18, 2019, 10:05 p.m. PoolakkalMukkath, Shakir, <
> Shakir_PoolakkalMukkath@comcast.com> wrote:
>
>> Hi Nick,
>>
>>
>>
>> It works that way by explicitly setting the –host. I got mislead by the
>> *“only”* word in doc and did not try. Thanks for the help
>>
>>
>>
>> Thanks,
>>
>> Shakir
>>
>> *From: *"Martin, Nick" <Nick.Martin@ngc.com>
>> *Date: *Tuesday, June 18, 2019 at 6:31 PM
>> *To: *"PoolakkalMukkath, Shakir" <Shakir_PoolakkalMukkath@comcast.com>,
>> Till Rohrmann <trohrmann@apache.org>, John Smith <java.dev.mtl@gmail.com>
>> *Cc: *user <user@flink.apache.org>
>> *Subject: *RE: [EXTERNAL] Re: How to restart/recover on reboot?
>>
>>
>>
>> Jobmanager.sh takes an optional argument for the hostname to bind to, and
>> start-cluster uses it. If you leave it blank it, the script will use
>> whatever is in flink-conf.yaml (localhost is the default value that ships
>> with flink).
>>
>>
>>
>> The dockerized version of flink runs pretty much the way you’re trying to
>> operate (i.e. each node starts itself), so the entrypoint script out of
>> that is probably a good source of information about how to set it up.
>>
>>
>>
>> *From:* PoolakkalMukkath, Shakir [mailto:
>> Shakir_PoolakkalMukkath@comcast.com]
>> *Sent:* Tuesday, June 18, 2019 2:15 PM
>> *To:* Till Rohrmann <trohrmann@apache.org>; John Smith <
>> java.dev.mtl@gmail.com>
>> *Cc:* user <user@flink.apache.org>
>> *Subject:* EXT :Re: [EXTERNAL] Re: How to restart/recover on reboot?
>>
>>
>>
>> Hi Tim,John,
>>
>>
>>
>> I do agree with the issue John mentioned and have the same problem.
>>
>>
>>
>> We can only *start* a standalone HA cluster with ./start-cluster.sh
>> script. And then when there are failures, we can *restart* those
>> components individually by calling jobmanager.sh/ jobmanager.sh.  This
>> works great
>>
>> But , Like John mentioned, If we want to start the cluster initially
>> itself by running the jobmanager.sh on each JobManager nodes, it is not
>> working. It binds to local and not forming the HA cluster.
>>
>>
>>
>> Thanks,
>>
>> Shakir
>>
>>
>>
>> *From: *Till Rohrmann <trohrmann@apache.org>
>> *Date: *Tuesday, June 18, 2019 at 4:23 PM
>> *To: *John Smith <java.dev.mtl@gmail.com>
>> *Cc: *user <user@flink.apache.org>
>> *Subject: *[EXTERNAL] Re: How to restart/recover on reboot?
>>
>>
>>
>> I guess it should work if you installed a systemd service which simply
>> calls `jobmanager.sh start` or `taskmanager.sh start`.
>>
>>
>>
>> Cheers,
>>
>> Till
>>
>>
>>
>> On Tue, Jun 18, 2019 at 4:29 PM John Smith <java.dev.mtl@gmail.com>
>> wrote:
>>
>> Yes, that is understood. But I don't see why we cannot call jobmanager.sh
>> and taskmanager.sh to build the cluster and have them run as systemd units.
>>
>> I looked at start-cluster.sh and all it does is SSH and call
>> jobmanager.sh which then cascades to taskmanager.sh I just have to pin
>> point what's missing to have systemd service working. In fact calling
>> jobmanager.sh as systemd service actually sees the shared masters, slaves
>> and flink-conf.yaml. But it binds to local host.
>>
>>
>>
>> Maybe one way to do it would be to bootstrap the cluster with
>> ./start-cluster.sh and then install systemd services for jobmanager.sh and
>> tsakmanager.sh
>>
>>
>>
>> Like I said I don't want to have some process in place to remind admins
>> they need to manually start a node every time they patch or a host goes
>> down for what ever reason.
>>
>>
>>
>> On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <trohrmann@apache.org> wrote:
>>
>> When a single machine fails you should rather call `taskmanager.sh
>> start`/`jobmanager.sh start` to start a single process. `start-cluster.sh`
>> will start multiple processes on different machines.
>>
>>
>>
>> Cheers,
>>
>> Till
>>
>>
>>
>> On Mon, Jun 17, 2019 at 4:30 PM John Smith <java.dev.mtl@gmail.com>
>> wrote:
>>
>> Well some reasons, machine reboots/maintenance etc... Host/VM crashes and
>> restarts. And same goes for the job manager. I don't want/need to have to
>> document/remember some start process for sys admins/devops.
>>
>> So far I have looked at ./start-cluster.sh and all it seems to do is SSH
>> into all the specified nodes and starts the processes using the jobmanager
>> and taskmanager scripts. I don't see anything special in any of the sh
>> scripts.
>> I configured passwordless ssh through terraform and all that works great
>> only when trying to do the manual start through systemd. I may have
>> something missing...
>>
>>
>>
>> On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <trohrmann@apache.org> wrote:
>>
>> Hi John,
>>
>>
>>
>> I have not much experience wrt setting Flink up via systemd services. Why
>> do you want to do it like that?
>>
>>
>>
>> 1. In standalone mode, Flink won't automatically restart TaskManagers.
>> This only works on Yarn and Mesos atm.
>>
>> 2. In case of a lost TaskManager, you should run `taskmanager.sh start`.
>> This script simply starts a new TaskManager process.
>>
>> 3. I guess you could use systemd to bring up a Flink TaskManager process
>> on start up.
>>
>>
>>
>> Cheers,
>>
>> Till
>>
>>
>>
>> On Fri, Jun 14, 2019 at 5:56 PM John Smith <java.dev.mtl@gmail.com>
>> wrote:
>>
>> I looked into the start-cluster.sh and I don't see anything special. So
>> technically it should be as easy as installing Systemd services to run
>> jobamanger.sh and taskmanager.sh respectively?
>>
>>
>>
>> On Wed, 12 Jun 2019 at 13:02, John Smith <java.dev.mtl@gmail.com> wrote:
>>
>> The installation instructions do not indicate how to create systemd
>> services.
>>
>>
>>
>> 1- When task nodes fail, will the job leader detect this and ssh and
>> restart the task node? From my testing it doesn't seem like it.
>>
>> 2- How do we recover a lost node? Do we simply go back to the master node
>> and run start-cluster.sh and the script is smart enough to figure out what
>> is missing?
>>
>> 3- Or do we need to create systemd services and if so on which command do
>> we start the service on?
>>
>>
>> ------------------------------
>>
>> Notice: This e-mail is intended solely for use of the individual or
>> entity to which it is addressed and may contain information that is
>> proprietary, privileged and/or exempt from disclosure under applicable law.
>> If the reader is not the intended recipient or agent responsible for
>> delivering the message to the intended recipient, you are hereby notified
>> that any dissemination, distribution or copying of this communication is
>> strictly prohibited. This communication may also contain data subject to
>> U.S. export laws. If so, data subject to the International Traffic in Arms
>> Regulation cannot be disseminated, distributed, transferred, or copied,
>> whether incorporated or in its original form, to foreign nationals residing
>> in the U.S. or abroad, absent the express prior approval of the U.S.
>> Department of State. Data subject to the Export Administration Act may not
>> be disseminated, distributed, transferred or copied contrary to U. S.
>> Department of Commerce regulations. If you have received this communication
>> in error, please notify the sender by reply e-mail and destroy the e-mail
>> message and any physical copies made of the communication.
>>  Thank you.
>> *********************
>>
>

Mime
View raw message