flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saliya Ekanayake <esal...@gmail.com>
Subject Flink is Unstable when TM > 1
Date Fri, 08 Jul 2016 03:57:31 GMT

I've been trying to run the provided KMeans example on a 16 node cluster. I
was testing with 2 Task Managers (TM) per node because each node has 2
sockets (CPUs). A socket contains 12 cores, so I've set the number of slots
per TM as 12.The total parallelism is 384 (12 slots x 2 TMs x 16 nodes).

However, Flink TMs keep failing time to time causing KMeans to fail. The
only explanation I could find from logs is that TMs unregister from Job
Manager. I've increased Akka timeout to 1000s as well.

Any suggestions on this?

The data sizes I tried were 10k points, 250k points, and 1mil points.
Number of centers were 100 to 1000. None of these sizes completed.

Thank you,

Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

View raw message