flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Carey <sca...@expedia.com>
Subject API request to submit job takes over 1hr
Date Wed, 01 Jun 2016 17:54:03 GMT
Hi folks,

I have deployed a Flink cluster on top of YARN in an AWS EMR cluster in my test environment,
and everything is working fine. However,  I am unable to submit jobs to the prod cluster.

Uploading the JAR containing a Flink job succeeds. However, the request to run the job (UI
makes API request to /jars/<jarname>/run?<params>) takes so long to complete that
the ELB finally returns a 504 GATEWAY_TIMEOUT response. This is the case even if the ELB timeout
is set to 1hr: the request returns 504 after 1hr. The request appears to fail server-side,
also, since no jobs have ever showed up in the UI as being in any status (successful/failed/completed
or otherwise). Shortly after the request is made, it is interesting to note that sometimes
(but not always), other requests by the UI to the API begin to take longer than usual, although
they do all eventually complete.

No interesting/suspicious log entries have been found. All YARN nodes appear healthy.

Does anyone have ideas about what the problem might be? Or ideas about troubleshooting steps
I should take?

Also, I was wondering if 1GB is a reasonable amount of memory to use for the Flink Job Manager?
It appears to be using only ~570MB but I am not sure if the Job Manager might be misbehaving
due to resource constraints. The prod cluster is currently composed of six c3.2xlarge EC2
instances. Task memory is set to 10496, Job Manager memory is set to 1024, and there are 8
slots set in the yarn-session.sh command. Are there any guidelines for memory allocation for
the Job Manager?

Thanks very much!
Shannon Carey
Mime
View raw message