flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-7540) Akka hostnames are not normalised consistently
Date Wed, 11 Oct 2017 23:27:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201187#comment-16201187

ASF GitHub Bot commented on FLINK-7540:

GitHub user tillrohrmann opened a pull request:


    [FLINK-7540] Apply consistent hostname normalization

    ## What is the purpose of the change
    The hostname normalization is now applied when generating the remote akka config.
    That way it should be ensured that all ActorSystems are bound to a normalized
    ## Brief change log
    - Add hostname normalization to `AkkaUtils#getAkkaConfig`
    - Replace manual ActorSystem instantiation with `BootstrapTools#startActorSystem`
    ## Verifying this change
    - Added `AkkaUtilsTest#getAkkaConfig`
    ## Does this pull request potentially affect one of the following parts:
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing,
Yarn/Mesos, ZooKeeper: (yes) It affects how `ActorSystem` are instantiated.
    ## Documentation
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixHostnameNormalization

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4812
commit 00876ead7a4a7492d643f6cba3e784044c54669e
Author: Till Rohrmann <trohrmann@apache.org>
Date:   2017-10-11T23:17:23Z

    [FLINK-7540] Apply consistent hostname normalization
    The hostname normalization is now applied when generationg the remote akka config.
    That way it should be ensured that all ActorSystems are bound to a normalized


> Akka hostnames are not normalised consistently
> ----------------------------------------------
>                 Key: FLINK-7540
>                 URL: https://issues.apache.org/jira/browse/FLINK-7540
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.3.1, 1.4.0, 1.3.2
>            Reporter: Tong Yan Ou
>            Assignee: Till Rohrmann
>            Priority: Critical
>              Labels: patch
>             Fix For: 1.3.3
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> In {{NetUtils.unresolvedHostToNormalizedString()}} we lowercase hostnames, Akka seems
to preserve the uppercase/lowercase distinctions when starting the Actor. This leads to problems
because other parts (for example {{JobManagerRetriever}}) cannot find the actor leading to
a nonfunctional cluster.
> h1. Original Issue Text
> Hostnames in my  hadoop cluster are like these: “DSJ-RTB-4T-177”,” DSJ-signal-900G-71”
> When using the following command:
> ./bin/flink run -m yarn-cluster -yn 1 -yqu xl_trip -yjm 1024 ~/flink-1.3.1/examples/batch/WordCount.jar
--input /user/all_trip_dev/test/testcount.txt --output /user/all_trip_dev/test/result  
> Or
> ./bin/yarn-session.sh -d -jm 6144  -tm 12288 -qu xl_trip -s 24 -n 5 -nm "flink-YarnSession-jm6144-tm12288-s24-n5-xl_trip"
> There will be some exceptions at Command line interface:
> java.lang.RuntimeException: Unable to get ClusterClient status from Application Client
> at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)
> …
> Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager.
Please check that the JobManager is running.
> h4. Then the job fails , starting the yarn-session is the same.
> The exceptions of the application log:
> 2017-08-10 17:36:10,334 WARN  org.apache.flink.runtime.webmonitor.JobManagerRetriever
      - Failed to retrieve leader gateway and port.
> akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/),
> …
> 2017-08-10 17:36:10,837 ERROR org.apache.flink.yarn.YarnFlinkResourceManager        
       - Resource manager could not register at JobManager
> akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/),
Path(/user/jobmanager)]] after [10000 ms]
> And I found some differences in actor System:
> 2017-08-10 17:35:56,791 INFO  org.apache.flink.yarn.YarnJobManager                  
       - Starting JobManager at akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager.
> 2017-08-10 17:35:56,880 INFO  org.apache.flink.yarn.YarnJobManager                  
       - JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager was granted leadership
with leader session ID Some(00000000-0000-0000-0000-000000000000).
> 2017-08-10 17:36:00,312 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor 
       - Web frontend listening at 0:0:0:0:0:0:0:0:54921
> 2017-08-10 17:36:00,312 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor 
       - Starting with JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager
on port 54921
> 2017-08-10 17:36:00,313 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever
      - New leader reachable under akka.tcp://flink@dsj-signal-4t-248:65082/user/jobmanager:00000000-0000-0000-0000-000000000000.
> The JobManager is  “akka.tcp://flink@DSJ-signal-4T-248:65082” and the JobManagerRetriever
is “akka.tcp://flink@dsj-signal-4t-248:65082”
> The hostname of JobManagerRetriever’s actor is lowercase.
> And I read source code,
> Class NetUtils the unresolvedHostToNormalizedString(String host) method of line 127:
> 	public static String unresolvedHostToNormalizedString(String host) { 		
> // Return loopback interface address if host is null 		
> // This represents the behavior of {@code InetAddress.getByName } and RFC 3330 		if (host
== null) { 			
>    host = InetAddress.getLoopbackAddress().getHostAddress(); 		
> } else { 			host = host.trim().toLowerCase(); 		}
> ...
> }
> It turns the host name into lowercase.
> Therefore, JobManagerRetriever certainly can not find Jobmanager's actorSYstem.
> Then I removed the call to the toLowerCase() method in the source code.
> Finally ,I can submit a job in yarn-cluster mode and start a yarn-session.

This message was sent by Atlassian JIRA

View raw message