flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ashish pok <ashish...@yahoo.com>
Subject Re: Permissions to delete Checkpoint on cancel
Date Mon, 23 Jul 2018 12:18:22 GMT
Just a follow-up. In absence of NAS then the best option to go with here is checkpoint and
savepoints both on HDFS and StateBackend using local SSDs then?
We were trying to not even hit HDFS other than for savepoints.

- Ashish

On Monday, July 23, 2018, 7:45 AM, ashish pok <ashishpok@yahoo.com> wrote:

I did have first point at the back of my mind. I was under the impression though for checkpoints,
cleanup would be done by TMs as they are being taken by TMs.
So for a standalone cluster with its own zookeeper for JM high availability, a NAS is a must
have? We were going to go with local checkpoints with access to remote HDFS for savepoints.
This sounds like it will be a bad idea then. Unfortunately we can’t run on YARN and NAS
is also a no-no in one of our datacenters - there is a mountain of security complainace to
climb before we will in Production if we need to go that route.
Thanks, Ashish

On Monday, July 23, 2018, 5:10 AM, Stefan Richter <s.richter@data-artisans.com> wrote:


I am wondering how this can even work properly if you are using a local fs for checkpoints
instead of a distributed fs. First, what happens under node failures, if the SSD becomes unavailable
or if a task gets scheduled to a different machine, and can no longer access the disk with
the  corresponding state data, or if you want to scale-out. Second, the same problem is also
what you can observe with the job manager: how could the checkpoint coordinator, that runs
on the JM, access a file on a local FS on a different node to cleanup the checkpoint data?
The purpose of using a distributed fs here is that all TM and the JM can access the checkpoint


> Am 22.07.2018 um 19:03 schrieb Ashish Pokharel <ashishpok@yahoo.com>:
> All,
> We recently moved our Checkpoint directory from HDFS to local SSDs mounted on Data Nodes
(we were starting to see perf impacts on checkpoints etc as complex ML apps were spinning
up more and more in YARN). This worked great other than the fact that when jobs are being
canceled or canceled with Savepoint, local data is not being cleaned up. In HDFS, Checkpoint
directories were cleaned up on Cancel and Cancel with Savepoints as far as I can remember.
I am wondering if it is permissions issue. Local disks have RWX permissions for both yarn
and flink headless users (flink headless user submits the apps to YARN using our CICD pipeline).

> Appreciate any pointers on this.
> Thanks, Ashish

View raw message