flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LINZ, Arnaud" <AL...@bouyguestelecom.fr>
Subject RE: HA Mode and standalone containers compatibility ?
Date Thu, 03 Dec 2015 10:47:44 GMT
Oopss... False joy. 

In fact, it does start another container, but this container ends immediately because the
job is not submitted to that container but to the streaming one.

Log details: 

Command = 
#  JVM_ARGS =  -DCluster.Parallelisme=150  -Drecovery.mode=standalone
/usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 5120 -yqu batch1 -ys 4 --class com.bouygtel.kubera.main.segstage.MainGeoSegStage
/home/voyager/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT-allinone.jar  -j /home/voyager/KBR/GOS/log
-c /home/voyager/KBR/GOS/cfg/KBR_GOS_Config.cfg 

Log = 
Found YARN properties file /tmp/.yarn-properties-voyager
YARN properties set default parallelism to 24
Using JobManager address from YARN properties bt1shli3.bpa.bouyguestelecom.fr/172.21.125.28:36700
YARN cluster mode detected. Switching Log4j output to console
11:39:18,192 INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl     - Timeline
service address: http://h1r1dn02.bpa.bouyguestelecom.fr:8188/ws/v1/timeline/
11:39:18,349 INFO  org.apache.hadoop.yarn.client.RMProxy                         - Connecting
to ResourceManager at h1r1nn01.bpa.bouyguestelecom.fr/172.21.125.3:8050
11:39:18,504 INFO  org.apache.flink.client.FlinkYarnSessionCli                   - No path
for the flink jar passed. Using the location of class org.apache.flink.yarn.FlinkYarnClient
to locate the jar
11:39:18,513 INFO  org.apache.flink.yarn.FlinkYarnClient                         - Using values:
11:39:18,515 INFO  org.apache.flink.yarn.FlinkYarnClient                         -   TaskManager
count = 48
11:39:18,515 INFO  org.apache.flink.yarn.FlinkYarnClient                         -   JobManager
memory = 1024
11:39:18,515 INFO  org.apache.flink.yarn.FlinkYarnClient                         -   TaskManager
memory = 5120
11:39:18,641 WARN  org.apache.flink.yarn.FlinkYarnClient                         - The JobManager
or TaskManager memory is below the smallest possible YARN Container size. The value of 'yarn.scheduler.minimum-allocation-mb'
is '2048'. Please increase the memory size.YARN will allocate the smaller containers but the
scheduler will account for the minimum-allocation-mb, maybe not all instances you requested
will start.
11:39:19,102 INFO  org.apache.flink.yarn.Utils                                   - Copying
from file:/usr/lib/flink/lib/flink-dist_2.11-0.10.0.jar to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/flink-dist_2.11-0.10.0.jar
11:39:19,653 INFO  org.apache.flink.yarn.Utils                                   - Copying
from /usr/lib/flink/conf/flink-conf.yaml to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/flink-conf.yaml
11:39:19,667 INFO  org.apache.flink.yarn.Utils                                   - Copying
from file:/usr/lib/flink/conf/logback.xml to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/logback.xml
11:39:19,679 INFO  org.apache.flink.yarn.Utils                                   - Copying
from file:/usr/lib/flink/conf/log4j.properties to hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/voyager/.flink/application_1449127732314_0046/log4j.properties
11:39:19,698 INFO  org.apache.flink.yarn.FlinkYarnClient                         - Submitting
application master application_1449127732314_0046
11:39:19,723 INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl         - Submitted
application application_1449127732314_0046
11:39:19,723 INFO  org.apache.flink.yarn.FlinkYarnClient                         - Waiting
for the cluster to be allocated
11:39:19,725 INFO  org.apache.flink.yarn.FlinkYarnClient                         - Deploying
cluster, current state ACCEPTED
11:39:20,727 INFO  org.apache.flink.yarn.FlinkYarnClient                         - Deploying
cluster, current state ACCEPTED
11:39:21,728 INFO  org.apache.flink.yarn.FlinkYarnClient                         - Deploying
cluster, current state ACCEPTED
11:39:22,730 INFO  org.apache.flink.yarn.FlinkYarnClient                         - Deploying
cluster, current state ACCEPTED
11:39:23,731 INFO  org.apache.flink.yarn.FlinkYarnClient                         - YARN application
has been deployed successfully.
11:39:23,734 INFO  org.apache.flink.yarn.FlinkYarnCluster                        - Start actor
system.
11:39:24,192 INFO  org.apache.flink.yarn.FlinkYarnCluster                        - Start application
client.
YARN cluster started
JobManager web interface address http://h1r1nn01.bpa.bouyguestelecom.fr:8088/proxy/application_1449127732314_0046/
Waiting until all TaskManagers have connected
11:39:24,202 INFO  org.apache.flink.yarn.ApplicationClient                       - Notification
about new leader address akka.tcp://flink@172.21.125.16:59907/user/jobmanager with session
ID null.
No status updates from the YARN cluster received so far. Waiting ...
11:39:24,206 INFO  org.apache.flink.yarn.ApplicationClient                       - Received
address of new leader akka.tcp://flink@172.21.125.16:59907/user/jobmanager with session ID
null.
11:39:24,206 INFO  org.apache.flink.yarn.ApplicationClient                       - Disconnect
from JobManager null.
11:39:24,210 INFO  org.apache.flink.yarn.ApplicationClient                       - Trying
to register at JobManager akka.tcp://flink@172.21.125.16:59907/user/jobmanager.
11:39:24,377 INFO  org.apache.flink.yarn.ApplicationClient                       - Successfully
registered at the JobManager Actor[akka.tcp://flink@172.21.125.16:59907/user/jobmanager#-801507205]
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (12/48)
TaskManager status (12/48)
TaskManager status (12/48)
TaskManager status (12/48)
TaskManager status (46/48)
TaskManager status (46/48)
TaskManager status (46/48)
TaskManager status (46/48)
All TaskManagers are connected
Using the parallelism provided by the remote cluster (192). To use another parallelism, set
it at the ./bin/flink client.
12/03/2015 11:39:55  Job execution switched to status RUNNING.
12/03/2015 11:39:55  CHAIN DataSource (at createInput(ExecutionEnvironment.java:508) (com.bouygtel.kuberasdk.hive.HiveHCatDAO$1))
-> FlatMap (FlatMap at readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1)(1/150)
switched to SCHEDULED 
12/03/2015 11:39:55  CHAIN DataSource (at createInput(ExecutionEnvironment.java:508) (com.bouygtel.kuberasdk.hive.HiveHCatDAO$1))
-> FlatMap (FlatMap at readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1)(1/150)
switched to DEPLOYING
=> The job starts

Then it crashes :

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free
slots available to run the job. You can decrease the operator parallelism or increase the
number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN
DataSource (at createInput(ExecutionEnvironment.java:508) (com.bouygtel.kuberasdk.hive.HiveHCatDAO$1))
-> FlatMap (FlatMap at readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1) (5/150))
@ (unassigned) - [SCHEDULED] > with groupID < 7b9e554a93d3ea946d13d239a99bb6ae >
in sharing group < SlotSharingGroup [0c9285747d113d8dd85962602b674497, 9f30db9a30430385e1cd9d0f5010ed9e,
36b825566212059be3f888e3bbdf0d96, f95ba68c3916346efe497b937393eb49, e73522cce11e699022c285180fd1024d,
988b776310ef3d8a2a3875227008a30e, 7b9e554a93d3ea946d13d239a99bb6ae, 08af3a01b9cb49b76e6aedcd57d57788,
3f91660c6ab25f0f77d8e55d54397b01] >. Resources available to scheduler: Number of instances=6,
total number of slots=24, available slots=0

Stating that I have only 24 slots on my 48 container cluster !




-----Message d'origine-----
De : LINZ, Arnaud 
Envoyé : jeudi 3 décembre 2015 11:26
À : user@flink.apache.org
Objet : RE: HA Mode and standalone containers compatibility ?

Hi,

The batch job does not need to be HA.
I stopped everything, cleaned the temp files, added -Drecovery.mode=standalone and it seems
to work now !
Strange, but good for me for now.

Thanks,
Arnaud

-----Message d'origine-----
De : Ufuk Celebi [mailto:uce@apache.org] Envoyé : jeudi 3 décembre 2015 11:11 À : user@flink.apache.org
Objet : Re: HA Mode and standalone containers compatibility ?

Hey Arnaud,

thanks for reporting this. I think Till’s suggestion will help to debug this (checking whether
a second YARN application has been started)…

You don’t want to run the batch application in HA mode, correct?

I sounds like the batch job is submitted with the same config keys. Could you start the batch
job explicitly with -Drecovery.mode=standalone?

If you do want the batch job to be HA as well, you have to configure separate Zookeeper root
paths:

recovery.zookeeper.path.root: /flink-streaming-1 # for the streaming session

recovery.zookeeper.path.root: /flink-batch # for the batch session

– Ufuk

> On 03 Dec 2015, at 11:01, LINZ, Arnaud <ALINZ@bouyguestelecom.fr> wrote:
> 
> Yes, it does interfere, I do have additional task managers. My batch application comes
in my streaming cluster Flink’s GUI instead of creating its own container with its own GUI
despite the –m yarn-cluster option.
>  
> De : Till Rohrmann [mailto:trohrmann@apache.org] Envoyé : jeudi 3 
> décembre 2015 10:36 À : user@flink.apache.org Objet : Re: HA Mode and 
> standalone containers compatibility ?
>  
> Hi Arnaud,
>  
> as long as you don't have HA activated for your batch jobs, HA shouldn't have an influence
on the batch execution. If it interferes, then you should see additional task manager connected
to the streaming cluster when you execute the batch job. Could you check that? Furthermore,
could you check that actually a second yarn application is started when you run the batch
jobs?
>  
> Cheers,
> Till
>  
> On Thu, Dec 3, 2015 at 9:57 AM, LINZ, Arnaud <ALINZ@bouyguestelecom.fr> wrote:
> Hello,
> 
>  
> 
> I have both streaming applications & batch applications. Since the memory needs are
not the same, I was using a long-living container for my streaming apps and new short-lived
containers for hosting each batch execution.
> 
>  
> 
> For that, I submit streaming jobs with "flink run"  and batch jobs with "flink run -m
yarn-cluster"
> 
>  
> 
> This was working fine until I turned zookeeper HA mode on for my streaming applications.
> 
> Even if I don't set it up in the yaml flink configuration file, but with -D options on
the yarn_session.sh command line, now my batch jobs try to run in the streaming container,
and fails because of the lack of ressources.
> 
>  
> 
> My HA options are :
> 
> -Dyarn.application-attempts=10 -Drecovery.mode=zookeeper
> -Drecovery.zookeeper.quorum=h1r1en01:2181
> -Drecovery.zookeeper.path.root=/flink  -Dstate.backend=filesystem 
> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/flink/checkpoints
> -Drecovery.zookeeper.storageDir=hdfs:///tmp/flink/recovery/
> 
>  
> 
> Am I missing something ?
> 
>  
> 
> Best regards,
> 
> Aranud
> 
>  
> 
> L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice
ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation
ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message,
merci de le détruire et d'avertir l'expéditeur.
> 
> The integrity of this message cannot be guaranteed on the Internet. The company that
sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized
use or dissemination is prohibited. If you are not the intended recipient of this message,
then please delete it and notify the sender.

Mime
View raw message