flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Diagnosing TaskManager disappearance
Date Thu, 29 Oct 2015 17:00:57 GMT
Hi Greg!

Interesting... When you say the TaskManagers are dropping, are the
TaskManager processes crashing, or are they loosing connection to the
JobManager?

Greetings,
Stephan


On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan <code@greghogan.com> wrote:

> I recently discovered that AWS uses NUMA for its largest nodes. An example
> c4.8xlarge:
>
> $ numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26
> node 0 size: 29813 MB
> node 0 free: 24537 MB
> node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35
> node 1 size: 30574 MB
> node 1 free: 22757 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>
> I discovered yesterday that Flink performed ~20-30% faster on large
> datasets by running two NUMA-constrained TaskManagers per node. The
> JobManager node ran a single TaskManager. Resources were divided in half
> relative to running a single TaskManager.
>
> The changes from the tail of /bin/taskmanager.sh:
>
> -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}"
> +numactl --membind=0 --cpunodebind=0 "${FLINK_BIN_DIR}"/flink-daemon.sh
> $STARTSTOP taskmanager "${args[@]}"
> +numactl --membind=1 --cpunodebind=1 "${FLINK_BIN_DIR}"/flink-daemon.sh
> $STARTSTOP taskmanager "${args[@]}"
>
> After reverting this change the system is again stable. I had not
> experienced issues using numactl when running 16 nodes.
>
> Greg
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message