flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Could not build up connection to JobManager
Date Wed, 25 Feb 2015 19:15:38 GMT
Okay, the problem seems to be that even though both the client and the
jobmanager use "localhost" as the host name, they resolve this to different
IP addresses: In one case 127.0.0.1 in the other case 10.216.177.146

 Also, the 127.0.0.1 address cannot communicate to 10.216.177.146
apparently.

Can you help us debug this by checking the following:

 - Can you try and set "jobmanager.rpc.address" to 127.0.0.1 and see if
that solves it?
 - Can you try and set "jobmanager.rpc.address" to the other address
(10.216.177.146
or so) and see if that solves it?
 - Can you do "start-cluster.sh", rather than "start-local.sh" and see
whether the webfrontend displays that the TaskManager connects?
 - As a hard core test: Can you bring up the jobmanager, check where it
connects (10.216.192.98:6123 or so) and see whether the port is reachable?

We have recently updated how the Akka URLs are build, to work around a
limitation in Akka. Seems that did not yet fully solve the issue.

Thanks for helping us debug this, it is not the easiest immigration
experience, but the outcome is probably extremely valuable for the project
:-)

Greetings,
Stephan


On Wed, Feb 25, 2015 at 4:03 PM, Dulaj Viduranga <vidura.me@icloud.com>
wrote:

> Hi,
> Sorry for the delay to reply on this issue.
> the jobmanager.rpc.address is set to “localhost” already in conf.yaml.
> This can’t be an issue because the job manager web interface works fine
> which also runs on localhost
>
>  bin/flink run <jar> doesn’t seem to work either. Let me send you my
> command and the result in terminal.
>
> bin/flink run
> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar
> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt
> $FLINK_DIRECTORY/count
>
> 20:32:16,442 WARN  org.apache.hadoop.util.NativeCodeLoader
>        - Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> org.apache.flink.client.program.ProgramInvocationException: Could not
> build up connection to JobManager.
>         at org.apache.flink.client.program.Client.run(Client.java:327)
>         at org.apache.flink.client.program.Client.run(Client.java:306)
>         at org.apache.flink.client.program.Client.run(Client.java:300)
>         at
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55)
>         at
> org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:483)
>         at
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437)
>         at
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353)
>         at org.apache.flink.client.program.Client.run(Client.java:250)
>         at
> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371)
>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344)
>         at
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087)
>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114)
> Caused by: java.io.IOException: JobManager at akka.tcp://
> flink@10.216.177.146:6123/user/jobmanager not reachable. Please make sure
> that the JobManager is running and its port is reachable.
>         at
> org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:897)
>         at
> org.apache.flink.runtime.client.JobClient$.createJobClient(JobClient.scala:151)
>         at
> org.apache.flink.runtime.client.JobClient$.createJobClientFromConfig(JobClient.scala:142)
>         at
> org.apache.flink.runtime.client.JobClient$.startActorSystemAndActor(JobClient.scala:125)
>         at
> org.apache.flink.runtime.client.JobClient.startActorSystemAndActor(JobClient.scala)
>         at org.apache.flink.client.program.Client.run(Client.java:322)
>         ... 15 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [10000 milliseconds]
>         at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>         at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:107)
>         at
> org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:893)
>         ... 20 more
>
> The exception above occurred while trying to run your command.
>
>
> > On Feb 25, 2015, at 1:29 AM, Stephan Ewen <sewen@apache.org> wrote:
> >
> > BTW: Does still work if you enter "localhost" for
> "jobmanager.rpc.address"
> > in your flink-conf.yaml ?
> >
> > On Tue, Feb 24, 2015 at 7:50 PM, Stephan Ewen <sewen@apache.org> wrote:
> >
> >> Hi!
> >>
> >> I think that this is a problem in the current master (probably in there
> >> since a few days ago). I am fixing it...
> >>
> >> Thanks for reporting it!
> >>
> >> Stephan
> >>
> >>
> >> On Tue, Feb 24, 2015 at 6:52 PM, Stephan Ewen <sewen@apache.org> wrote:
> >>
> >>> Hi Dulaj!
> >>>
> >>> The log suggests that the JobManager binds itself to the IP
> >>> address 10.216.192.98 and the WebClient runs at 127.0.0.1
> >>>
> >>> The 127.0.0.1 actor system cannot connect to the 10.216.192.98.
> >>>
> >>> Let me verify whether this is a quirk of your particular setup, or a
> bug
> >>> recently introduces in the 0.9-SNAPSHOT.
> >>>
> >>> Does the command line work for you? ("bin/flink run <jar>")
> >>>
> >>> taskmanager.numberOfTaskSlots: -1  is also okay, this will mean that
> the
> >>> default of '1' is used.
> >>>
> >>> Greetings,
> >>> Stephan
> >>>
> >>>
> >>>
> >>> On Tue, Feb 24, 2015 at 5:18 PM, Dulaj Viduranga <vidura.me@icloud.com
> >
> >>> wrote:
> >>>
> >>>> Is taskmanager.numberOfTaskSlots: -1 normal?
> >>>>
> >>>>> On Feb 24, 2015, at 9:44 PM, Robert Metzger <rmetzger@apache.org>
> >>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>> I could not find the logfiles attached to your mails. I think the
> >>>>> mailinglists are not accepting attachments.
> >>>>> Can you put the logs on gist.github.com?
> >>>>>
> >>>>> The configuration values are documented here:
> >>>>> http://flink.apache.org/docs/0.8/config.html
> >>>>> For the webclient's port its called webclient.port
> >>>>>
> >>>>> On Tue, Feb 24, 2015 at 5:04 PM, Dulaj Viduranga <
> vidura.me@icloud.com
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> I tried to kill the job manager manually in the terminal and
start
> it
> >>>>>> again but no luck. Also could you tell me if it’s possible
to change
> >>>>>> webclient’s port (8080) ?
> >>>>>>
> >>>>>>> On Feb 24, 2015, at 1:41 PM, Stephan Ewen <sewen@apache.org>
> wrote:
> >>>>>>>
> >>>>>>> Hey Dulaj!
> >>>>>>>
> >>>>>>> As a contributor, I would go against the latest version,
which is
> >>>>>>> 0.9-SNAPSHOT.
> >>>>>>>
> >>>>>>> It may be in your case that the JobManager actor is down,
but the
> >>>> process
> >>>>>>> still lingers. (BTW: I have a patch pending that makes sure
the
> >>>> process
> >>>>>>> disappears when the actor via down).
> >>>>>>>
> >>>>>>> Could you have a look at the log
> >>>> "flink-<user>-jobmanager-<host>-.log"
> >>>>>> and
> >>>>>>> see if there are any errors logged?
> >>>>>>>
> >>>>>>> Greetings,
> >>>>>>> Stephan
> >>>>>>> Am 24.02.2015 06:29 schrieb "Dulaj Viduranga" <
> vidura.me@icloud.com
> >>>>> :
> >>>>>>>
> >>>>>>>> The JobManager seems to run fine. I don't know. When
I tried to
> run
> >>>>>>>> start-local.sh again, It shows the PID of the running
JobManager
> and
> >>>>>> also
> >>>>>>>> :8081 runs fine. I want to contribute to the project
and I could
> >>>> get a
> >>>>>>>> little boost if I could see the capabilities of FLINK.
:)
> >>>>>>>> Will it be OK to use 0.8.1 as a developer?
> >>>>>>>>
> >>>>>>>> On Feb 24, 2015, at 04:15 AM, Stephan Ewen <sewen@apache.org>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Dulaj,
> >>>>>>>>
> >>>>>>>> That error message indicates that the JobManager is
not running.
> >>>> Are you
> >>>>>>>> sure that the JobManager runs properly? Anything in
the JobManager
> >>>> logs?
> >>>>>>>>
> >>>>>>>> BTW: The 0.9 branch is under heavy development / changes.
That is
> >>>> why it
> >>>>>>>> may behave a bit different on different days right now.
I would
> >>>>>> recommend
> >>>>>>>> to use the 0.8.1 release for a stable experience.
> >>>>>>>>
> >>>>>>>> Greetings,
> >>>>>>>> Stephan
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Feb 23, 2015 at 7:39 PM, Robert Metzger <
> >>>> rmetzger@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Thank you for the quick reply.
> >>>>>>>>
> >>>>>>>> The log you've send is from the webclient. Can you also
send the
> >>>> log of
> >>>>>> the
> >>>>>>>>
> >>>>>>>> JobManager?
> >>>>>>>>
> >>>>>>>> On Mon, Feb 23, 2015 at 7:28 PM, Dulaj Viduranga <
> >>>> vidura.me@icloud.com>
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Yes. It seams it is not a problem with the arguments.
I tried two
> >>>> days
> >>>>>>>>
> >>>>>>>> but
> >>>>>>>>
> >>>>>>>>> different error occurs. It seams the web client
can’t connect to
> >>>> the
> >>>>>> job
> >>>>>>>>
> >>>>>>>>> manager although it is running
> >>>>>>>>
> >>>>>>>>> Right now, I can’t even get the webclient to run.
> >>>>>>>>
> >>>>>>>> ./bin/start-webclient.sh
> >>>>>>>>
> >>>>>>>>> executes fine but I cannot connect to localhost:8080
(even with
> >>>> telnet
> >>>>>> or
> >>>>>>>>
> >>>>>>>>> curl)
> >>>>>>>>
> >>>>>>>>> Here is the log for jobManager
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>> 23:22:31,933 INFO org.apache.flink.client.web.WebInterfaceServer
> >>>>>>>>
> >>>>>>>>> - Setting up web frontend server, using web-root
directory
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> 'jar:
> >>>>>>>>
> >>>>>>
> >>>>
> file:/Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/lib/flink-clients-0.9-SNAPSHOT.jar!/web-docs
> >>>>>>>> '.
> >>>>>>>>
> >>>>>>>>> 23:22:31,934 INFO org.apache.flink.client.web.WebInterfaceServer
> >>>>>>>>
> >>>>>>>>> - Web frontend server will store temporary files
in
> >>>>>>>>
> >>>>>>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T',
uploaded jobs
> >>>> in
> >>>>>>>>
> >>>>>>>>>
> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-jobs',
> >>>>>>>>
> >>>>>>>>> plan-json-dumps in
> >>>>>>>>
> >>>>>>>>>
> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-plans'.
> >>>>>>>>
> >>>>>>>>> 23:22:31,934 INFO org.apache.flink.client.web.WebInterfaceServer
> >>>>>>>>
> >>>>>>>>> - Web-frontend will submit jobs to nephele job-manager
on
> >>>>>>>>
> >>>>>>>> localhost,
> >>>>>>>>
> >>>>>>>>> port 6123.
> >>>>>>>>
> >>>>>>>>> 23:22:32,580 INFO akka.event.slf4j.Slf4jLogger
> >>>>>>>>
> >>>>>>>>> - Slf4jLogger started
> >>>>>>>>
> >>>>>>>>> 23:22:32,625 INFO Remoting
> >>>>>>>>
> >>>>>>>>> - Starting remoting
> >>>>>>>>
> >>>>>>>>> 23:22:32,838 INFO Remoting
> >>>>>>>>
> >>>>>>>>> - Remoting started; listening on addresses :[akka.tcp://
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> JobsInfoServletActorSystem@127.0.0.1:51517]
> >>>>>>>>
> >>>>>>>>> 23:23:48,119 WARN Remoting
> >>>>>>>>
> >>>>>>>>> - Tried to associate with unreachable remote address
[akka.tcp://
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> flink@10.218.98.169:6123]. Address is now gated
for 5000 ms, all
> >>>>>>>>
> >>>>>>>> messages
> >>>>>>>>
> >>>>>>>>> to this address will be delivered to dead letters.
Reason:
> >>>> Operation
> >>>>>>>>
> >>>>>>>> timed
> >>>>>>>>
> >>>>>>>>> out: /10.218.98.169:6123
> >>>>>>>>
> >>>>>>>>> 23:23:48,124 ERROR org.apache.flink.client.WebFrontend
> >>>>>>>>
> >>>>>>>>> - Unexpected exception: Could not find job manager
at specified
> >>>>>>>>
> >>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager'>tcp://
> >>>>>>>> flink@10.218.98.169:6123/user/jobmanager.
> >>>>>>>>
> >>>>>>>>> java.lang.RuntimeException: Could not find job manager
at
> specified
> >>>>>>>>
> >>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager'>tcp://
> >>>>>>>> flink@10.218.98.169:6123/user/jobmanager.
> >>>>>>>>
> >>>>>>>>> at
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> org.apache.flink.client.web.JobsInfoServlet.<init>(JobsInfoServlet.java:82)
> >>>>>>>>
> >>>>>>>>> at
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> org.apache.flink.client.web.WebInterfaceServer.<init>(WebInterfaceServer.java:158)
> >>>>>>>>
> >>>>>>>>> at org.apache.flink.client.WebFrontend.main(WebFrontend.java:74)
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>>> On Feb 23, 2015, at 11:46 PM, Robert Metzger
<
> rmetzger@apache.org
> >>>>>
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>>> you said in the other email thread that the
error only occurs
> for
> >>>>>>>>
> >>>>>>>>>> Wordcount, not for Kmeans.
> >>>>>>>>
> >>>>>>>>>> Can you copy me the commands for both examples?
> >>>>>>>>
> >>>>>>>>>> I can not really believe that there is a difference
between the
> >>>> two
> >>>>>>>>
> >>>>>>>> jobs.
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>>> Can you also send us the contents of the jobmanager
log file?
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>
> >>>>>>>>>> Robert
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>>> On Mon, Feb 23, 2015 at 6:04 PM, Dulaj Viduranga
<
> >>>>>> vidura.me@icloud.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>>>> I’m getting "Could not build up connection
to JobManager.”
> When i
> >>>>>>>>
> >>>>>>>> tried
> >>>>>>>>
> >>>>>>>>> to
> >>>>>>>>
> >>>>>>>>>>> run the wordCount example. Can anyone help?
> >>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>>>>> Dulaj
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message