beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Sidhom (JIRA)" <>
Subject [jira] [Commented] (BEAM-5110) Reconile Flink JVM singleton management with deployment
Date Tue, 14 Aug 2018 23:37:00 GMT


Ben Sidhom commented on BEAM-5110:

I meant to mention in the previous comment: as of now, this is not a correctness issue but
rather a performance/resources issue. We generally want to allow SDK environment reuse to
amortize startup costs (and save memory).

The fact that the Python SDK does _not_ handle multiple control clients over a single channel
is arguably an SDK harness bug or a configuration issue (e.g., we may need to allow the environment
factory/manager to spawn new environments if requested). This is something that should be
fleshed out in the fn-api contract and is orthogonal to JVM instance management in the portable
Flink runner.

> Reconile Flink JVM singleton management with deployment
> -------------------------------------------------------
>                 Key: BEAM-5110
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>            Reporter: Ben Sidhom
>            Assignee: Ben Sidhom
>            Priority: Major
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
> [~angoenka] noticed through debugging that multiple instances of BatchFlinkExecutableStageContext.BatchFactory
are loaded for a given job when executing in standalone cluster mode. This context factory
is responsible for maintaining singleton state across a TaskManager (JVM) in order to share
SDK Environments across workers in a given job. The multiple-loading breaks singleton semantics
and results in an indeterminate number of Environments being created.
> It turns out that the [Flink classloading mechanism|]
is determined by deployment mode. Note that "user code" as referenced by this link is actually
the Flink job server jar. Actual end-user code lives inside of the SDK Environment and uploaded
> In order to maintain singletons without resorting to IPC (for example, using file locks
and/or additional gRPC servers), we need to force non-dynamic classloading. For example,
this happens when jobs are submitted to YARN for one-off deployments via `flink run`. However,
connecting to an existing (Flink standalone) deployment results in dynamic classloading.
> We should investigate this behavior and either document (and attempt to enforce) deployment
modes that are consistent with our requirements, or (if possible) create a custom classloader
that enforces singleton loading.

This message was sent by Atlassian JIRA

View raw message