mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Massenzio <ma...@mesosphere.io>
Subject Re: Not able to connect to mesos from different machine
Date Fri, 29 May 2015 06:48:29 GMT
Apologies in advance if you already know all this and are an expert on vbox
& networking - but maybe this either helps or at least may point you in the
right direction (hopefully!)

The problem is most likely to be found in the fact that your laptop (or
whatever box you're running vbox in) has a hostname that's not
DNS-resolvable (and probably neither your VMs do).

Further, by default, VBox configures the VM's NICs to be on a 'Bridged'
private subnet, which means that you can 'net out' (eg, ping google.com
from the VM) but not get in (eg, run a server accessible from outside the
VM)

Mesos master/slave need to be able to talk to each other, bi-directionally,
which is possibly what was causing the issue in the first place.

NAT'ing the VMs won't probably work either (you won't know in advance which
port the Slave will be listening on - I think!)

One option is to configure vbox's VMs to be on their own subnet (I forget
the exact terminology, it's been almost a year now since I fiddled with it:
I think it's the Host-Only option
<https://www.virtualbox.org/manual/ch06.html#network_hostonly>) but
essentially vbox will create a subnet and act as a router - the host
machine will also have a virtual NIC in that subnet, so you'll be able to
route requests to/from the VMs.

There's also the fact that the Spark driver (pyspark, or spark-submit) will
need to be able to talk to the worker nodes, but that should "just work"
once you get Mesos to work.

HTH,


*Marco Massenzio*
*Distributed Systems Engineer*

On Thu, May 28, 2015 at 11:13 PM, Alberto Rodriguez <ardlema@gmail.com>
wrote:

> To be honest I don't know what was the problem. I didn't manage to make my
> Spark jobs work on the mesos cluster running on two virtual machines. I
> managed to make it work when I run my Spark jobs on my local machine and
> both master and mesos slaves are running also in my machine.
>
> I guess something is not working properly in the way that virtualbox is
> assigning their network interfaces to the virtual machines but I can't
> waste more time in the issue.
>
> Thank you again for your help!
>
> 2015-05-28 19:28 GMT+02:00 Alex Rukletsov <alex@mesosphere.com>:
>
> > Great! Mind sharing with the list what the problem was (for future
> > reference)?
> >
> > On Thu, May 28, 2015 at 5:25 PM, Alberto Rodriguez <ardlema@gmail.com>
> > wrote:
> >
> > > Hi Alex,
> > >
> > > I managed to make it work!! Finally I'm running both mesos master and
> > slave
> > > in my laptop and picking up the spark jar from a hdfs installed in a
> VM.
> > > I've just launched an spark job and is working fine!
> > >
> > > Thank you very much for your help
> > >
> > > 2015-05-28 16:20 GMT+02:00 Alberto Rodriguez <ardlema@gmail.com>:
> > >
> > > > Hi Alex,
> > > >
> > > > see following an extract of the chronos log (not sure whether this is
> > the
> > > > log you were talking about):
> > > >
> > > > 2015-05-28_14:18:28.49322 [2015-05-28 14:18:28,491] INFO No tasks
> > > > scheduled! Declining offers
> > > > (com.airbnb.scheduler.mesos.MesosJobFramework:106)
> > > > 2015-05-28_14:18:34.49896 [2015-05-28 14:18:34,497] INFO Received
> > > resource
> > > > offers
> > > > 2015-05-28_14:18:34.49903
> > > >  (com.airbnb.scheduler.mesos.MesosJobFramework:87)
> > > > 2015-05-28_14:18:34.50036 [2015-05-28 14:18:34,498] INFO No tasks
> > > > scheduled! Declining offers
> > > > (com.airbnb.scheduler.mesos.MesosJobFramework:106)
> > > > 2015-05-28_14:18:40.50442 [2015-05-28 14:18:40,503] INFO Received
> > > resource
> > > > offers
> > > > 2015-05-28_14:18:40.50444
> > > >  (com.airbnb.scheduler.mesos.MesosJobFramework:87)
> > > > 2015-05-28_14:18:40.50506 [2015-05-28 14:18:40,503] INFO No tasks
> > > > scheduled! Declining offers
> > > > (com.airbnb.scheduler.mesos.MesosJobFramework:106)
> > > >
> > > > I'm using 0.20.1 because I'm using this vagrant machine:
> > > > https://github.com/Banno/vagrant-mesos
> > > >
> > > > Kind regards and thank you again for your help
> > > >
> > > > 2015-05-28 14:09 GMT+02:00 Alex Rukletsov <alex@mesosphere.com>:
> > > >
> > > >> Alberto,
> > > >>
> > > >> it looks like Spark scheduler disconnects right after establishing
> the
> > > >> connection. Would you mind sharing scheduler logs as well? Also I
> see
> > > that
> > > >> you haven't specified the failover_timeout, try setting this value
> to
> > > >> something meaningful (several hours for test purposes).
> > > >>
> > > >> And by the way, any reason you're still on Mesos 0.20.1?
> > > >>
> > > >> On Wed, May 27, 2015 at 5:32 PM, Alberto Rodriguez <
> ardlema@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi Alex,
> > > >> >
> > > >> > I do not know what's going on, now I'm unable to access the spark
> > > >> console
> > > >> > again, it's hanging up in the same point as before. See following
> > the
> > > >> > master logs:
> > > >> >
> > > >> > 2015-05-27_15:30:53.68764 I0527 15:30:53.687494   944
> > master.cpp:3760]
> > > >> > Sending 1 offers to framework
> > 20150527-100126-169978048-5050-1851-0001
> > > >> > (chronos-2.3.0_mesos-0.20.1-SNAPSHOT) at
> > scheduler-be29901f-39ab-4bdf
> > > >> > -a9ec-691032775860@192.168.33.10:32768
> > > >> > 2015-05-27_15:30:53.69032 I0527 15:30:53.690196   942
> > master.cpp:2273]
> > > >> > Processing ACCEPT call for offers: [
> > > >> > 20150527-152023-169978048-5050-876-O241 ] on slave
> > > >> > 20150527-152023-169978048-5050-876-S0 at slave(1)@19
> > > >> > 2.168.33.11:5051 (mesos-slave1) for framework
> > > >> > 20150527-100126-169978048-5050-1851-0001
> > > >> > (chronos-2.3.0_mesos-0.20.1-SNAPSHOT) at
> > > >> >
> scheduler-be29901f-39ab-4bdf-a9ec-691032775860@192.168.33.10:32768
> > > >> > 2015-05-27_15:30:53.69038 I0527 15:30:53.690300   942
> > > >> hierarchical.hpp:648]
> > > >> > Recovered mem(*):1024; cpus(*):2; disk(*):33375;
> > > ports(*):[31000-32000]
> > > >> > (total allocatable: mem(*):1024; cpus(*):2; disk(*):33375; port
> > > >> > s(*):[31000-32000]) on slave 20150527-152023-169978048-5050-876-S0
> > > from
> > > >> > framework 20150527-100126-169978048-5050-1851-0001
> > > >> > 2015-05-27_15:30:54.00952 I0527 15:30:54.009363   937
> > master.cpp:1574]
> > > >> > Received registration request for framework 'Spark shell' at
> > > >> > scheduler-15df0294-c03c-4645-9079-a48128c68422@127.0.0.1:55562
> > > >> > 2015-05-27_15:30:54.00957 I0527 15:30:54.009461   937
> > master.cpp:1638]
> > > >> > Registering framework 20150527-152023-169978048-5050-876-0026
> (Spark
> > > >> shell)
> > > >> > at scheduler-15df0294-c03c-4645-9079-a48128c68422@127.0.0.1:5556
> > > >> > 2
> > > >> > 2015-05-27_15:30:54.00994 I0527 15:30:54.009703   937
> > > >> hierarchical.hpp:321]
> > > >> > Added framework 20150527-152023-169978048-5050-876-0026
> > > >> > 2015-05-27_15:30:54.00996 I0527 15:30:54.009826   937
> > master.cpp:3760]
> > > >> > Sending 1 offers to framework
> > 20150527-152023-169978048-5050-876-0026
> > > >> > (Spark shell) at
> > scheduler-15df0294-c03c-4645-9079-a48128c68422@127.0.
> > > >> > 0.1:55562
> > > >> > 2015-05-27_15:30:54.01035 I0527 15:30:54.010267   944
> > master.cpp:878]
> > > >> > Framework 20150527-152023-169978048-5050-876-0026 (Spark shell)
at
> > > >> > scheduler-15df0294-c03c-4645-9079-a48128c68422@127.0.0.1:55562
> > > >> disconnecte
> > > >> > d
> > > >> > 2015-05-27_15:30:54.01037 I0527 15:30:54.010308   944
> > master.cpp:1948]
> > > >> > Disconnecting framework 20150527-152023-169978048-5050-876-0026
> > (Spark
> > > >> > shell) at
> > scheduler-15df0294-c03c-4645-9079-a48128c68422@127.0.0.1:55
> > > >> > 562
> > > >> > 2015-05-27_15:30:54.01038 I0527 15:30:54.010326   944
> > master.cpp:1964]
> > > >> > Deactivating framework 20150527-152023-169978048-5050-876-0026
> > (Spark
> > > >> > shell) at
> > > scheduler-15df0294-c03c-4645-9079-a48128c68422@127.0.0.1:555
> > > >> > 62
> > > >> > 2015-05-27_15:30:54.01053 I0527 15:30:54.010447   939
> > > >> hierarchical.hpp:400]
> > > >> > Deactivated framework 20150527-152023-169978048-5050-876-0026
> > > >> > 2015-05-27_15:30:54.01055 I0527 15:30:54.010459   944
> > master.cpp:900]
> > > >> > Giving framework 20150527-152023-169978048-5050-876-0026 (Spark
> > shell)
> > > >> at
> > > >> > scheduler-15df0294-c03c-4645-9079-a48128c68422@127.0.0.1:55562
> 0ns
> > > >> > to failover
> > > >> >
> > > >> >
> > > >> > Kind regards and thank you very much for your help!!
> > > >> >
> > > >> >
> > > >> >
> > > >> > 2015-05-27 16:28 GMT+02:00 Alex Rukletsov <alex@mesosphere.com>:
> > > >> >
> > > >> > > Alberto,
> > > >> > >
> > > >> > > would you mind providing slave and master logs (or appropriate
> > parts
> > > >> of
> > > >> > > them)? Have you specified the --work_dir flag for your Mesos
> > > Workers?
> > > >> > >
> > > >> > > On Wed, May 27, 2015 at 3:56 PM, Alberto Rodriguez <
> > > ardlema@gmail.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi Alex,
> > > >> > > >
> > > >> > > > Thank you for replying. I managed to fix the first
problem but
> > now
> > > >> > when I
> > > >> > > > launch a spark job through my console mesos is losing
all the
> > > >> tasks. I
> > > >> > > can
> > > >> > > > see them all in my mesos slave but their status is
LOST. The
> > > stderr
> > > >> &
> > > >> > > > stdout files of the tasks are both empty.
> > > >> > > >
> > > >> > > > Any ideas?
> > > >> > > >
> > > >> > > > 2015-05-26 17:35 GMT+02:00 Alex Rukletsov <
> alex@mesosphere.com
> > >:
> > > >> > > >
> > > >> > > > > Alberto,
> > > >> > > > >
> > > >> > > > > What may be happening in your case is that Master
is not
> able
> > to
> > > >> talk
> > > >> > > to
> > > >> > > > > your scheduler. When responding to a scheduler,
Mesos Master
> > > >> doesn't
> > > >> > > use
> > > >> > > > > the IP from which a request came from, but rather
an IP set
> in
> > > the
> > > >> > > > > "Libprocess-from" field instead. That's exactly
what you
> > specify
> > > >> in
> > > >> > > > > LIBPROCESS_IP env var prior starting your scheduler.
Could
> you
> > > >> please
> > > >> > > > > double check the it set up correctly and that
IP is
> reachable
> > > for
> > > >> > Mesos
> > > >> > > > > Master?
> > > >> > > > >
> > > >> > > > > In case you are not able to solve the problem,
please
> provide
> > > >> > scheduler
> > > >> > > > and
> > > >> > > > > Master logs together with master, zookeeper, and
scheduler
> > > >> > > > configurations.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Mon, May 25, 2015 at 6:30 PM, Alberto Rodriguez
<
> > > >> > ardlema@gmail.com>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi all,
> > > >> > > > > >
> > > >> > > > > > I managed to get a mesos cluster up &
running on a Ubuntu
> > VM.
> > > >> I've
> > > >> > > > > > been also able to run and connect a spark-shell
from this
> > > >> machine
> > > >> > and
> > > >> > > > > > it works properly.
> > > >> > > > > >
> > > >> > > > > > Unfortunately, I'm trying to connect from
the host machine
> > > where
> > > >> > the
> > > >> > > > > > VM is running to launch spark jobs and I
can not.
> > > >> > > > > >
> > > >> > > > > > See below the spark console output:
> > > >> > > > > >
> > > >> > > > > > Using Scala version 2.10.4 (Java HotSpot(TM)
64-Bit Server
> > VM,
> > > >> Java
> > > >> > > > > > 1.7.0_75)
> > > >> > > > > > Type in expressions to have them evaluated.
> > > >> > > > > > Type :help for more information.
> > > >> > > > > > 15/05/25 18:13:00 INFO SecurityManager: Changing
view acls
> > to:
> > > >> > > > arodriguez
> > > >> > > > > > 15/05/25 18:13:00 INFO SecurityManager: Changing
modify
> acls
> > > to:
> > > >> > > > > arodriguez
> > > >> > > > > > 15/05/25 18:13:00 INFO SecurityManager: SecurityManager:
> > > >> > > > > > authentication disabled; ui acls disabled;
users with view
> > > >> > > > > > permissions: Set(arodriguez); users with
modify
> permissions:
> > > >> > > > > > Set(arodriguez)
> > > >> > > > > > 15/05/25 18:13:01 INFO Slf4jLogger: Slf4jLogger
started
> > > >> > > > > > 15/05/25 18:13:01 INFO Remoting: Starting
remoting
> > > >> > > > > > 15/05/25 18:13:01 INFO Remoting: Remoting
started;
> listening
> > > on
> > > >> > > > > > addresses :[akka.tcp://sparkDriver@localhost.localdomain
> > > :47229]
> > > >> > > > > > 15/05/25 18:13:01 INFO Utils: Successfully
started service
> > > >> > > > > > 'sparkDriver' on port 47229.
> > > >> > > > > > 15/05/25 18:13:01 INFO SparkEnv: Registering
> > MapOutputTracker
> > > >> > > > > > 15/05/25 18:13:01 INFO SparkEnv: Registering
> > > BlockManagerMaster
> > > >> > > > > > 15/05/25 18:13:01 INFO DiskBlockManager:
Created local
> > > >> directory at
> > > >> > > > > > /tmp/spark-local-20150525181301-7fa8
> > > >> > > > > > 15/05/25 18:13:01 INFO MemoryStore: MemoryStore
started
> with
> > > >> > capacity
> > > >> > > > > > 265.4 MB
> > > >> > > > > > 15/05/25 18:13:01 WARN NativeCodeLoader:
Unable to load
> > > >> > native-hadoop
> > > >> > > > > > library for your platform... using builtin-java
classes
> > where
> > > >> > > > > > applicable
> > > >> > > > > > 15/05/25 18:13:01 INFO HttpFileServer: HTTP
File server
> > > >> directory
> > > >> > is
> > > >> > > > > > /tmp/spark-1249c23f-adc8-4fcd-a044-b65a80f40e16
> > > >> > > > > > 15/05/25 18:13:01 INFO HttpServer: Starting
HTTP Server
> > > >> > > > > > 15/05/25 18:13:01 INFO Utils: Successfully
started service
> > > 'HTTP
> > > >> > file
> > > >> > > > > > server' on port 51659.
> > > >> > > > > > 15/05/25 18:13:01 INFO Utils: Successfully
started service
> > > >> > 'SparkUI'
> > > >> > > > > > on port 4040.
> > > >> > > > > > 15/05/25 18:13:01 INFO SparkUI: Started SparkUI
at
> > > >> > > > > > http://localhost.localdomain:4040
> > > >> > > > > > WARNING: Logging before InitGoogleLogging()
is written to
> > > STDERR
> > > >> > > > > > W0525 18:13:01.749449 10908 sched.cpp:1323]
> > > >> > > > > > **************************************************
> > > >> > > > > > Scheduler driver bound to loopback interface!
Cannot
> > > communicate
> > > >> > with
> > > >> > > > > > remote master(s). You might want to set 'LIBPROCESS_IP'
> > > >> environment
> > > >> > > > > > variable to use a routable IP address.
> > > >> > > > > > **************************************************
> > > >> > > > > > 2015-05-25
> > 18:13:01,749:10746(0x7fd4b1ffb700):ZOO_INFO@log_env
> > > >> > @712:
> > > >> > > > > > Client environment:zookeeper.version=zookeeper
C client
> > 3.4.6
> > > >> > > > > > 2015-05-25
> > 18:13:01,749:10746(0x7fd4b1ffb700):ZOO_INFO@log_env
> > > >> > @716:
> > > >> > > > > > Client environment:host.name=localhost.localdomain
> > > >> > > > > > 2015-05-25
> > 18:13:01,749:10746(0x7fd4b1ffb700):ZOO_INFO@log_env
> > > >> > @723:
> > > >> > > > > > Client environment:os.name=Linux
> > > >> > > > > > 2015-05-25
> > 18:13:01,749:10746(0x7fd4b1ffb700):ZOO_INFO@log_env
> > > >> > @724:
> > > >> > > > > > Client environment:os.arch=3.19.7-200.fc21.x86_64
> > > >> > > > > > 2015-05-25
> > 18:13:01,749:10746(0x7fd4b1ffb700):ZOO_INFO@log_env
> > > >> > @725:
> > > >> > > > > > Client environment:os.version=#1 SMP Thu
May 7 22:00:21
> UTC
> > > 2015
> > > >> > > > > > 2015-05-25
> > 18:13:01,749:10746(0x7fd4b1ffb700):ZOO_INFO@log_env
> > > >> > @733:
> > > >> > > > > > Client environment:user.name=arodriguez
> > > >> > > > > > I0525 18:13:01.749791 10908 sched.cpp:157]
Version: 0.22.1
> > > >> > > > > > 2015-05-25
> > 18:13:01,749:10746(0x7fd4b1ffb700):ZOO_INFO@log_env
> > > >> > @741:
> > > >> > > > > > Client environment:user.home=/home/arodriguez
> > > >> > > > > > 2015-05-25
> > 18:13:01,749:10746(0x7fd4b1ffb700):ZOO_INFO@log_env
> > > >> > @753:
> > > >> > > > > > Client
> > > >> > > > > >
> > > >> > >
> > > >>
> > environment:user.dir=/home/arodriguez/dev/spark-1.2.0-bin-hadoop2.4/bin
> > > >> > > > > > 2015-05-25
> > > >> > 18:13:01,749:10746(0x7fd4b1ffb700):ZOO_INFO@zookeeper_init
> > > >> > > > > @786:
> > > >> > > > > > Initiating client connection, host=10.141.141.10:2181
> > > >> > > > > > sessionTimeout=10000 watcher=0x7fd4c2f0d5b0
sessionId=0
> > > >> > > > > > sessionPasswd=<null> context=0x7fd3d40063c0
flags=0
> > > >> > > > > > 2015-05-25
> > > >> 18:13:01,750:10746(0x7fd4ab7fe700):ZOO_INFO@check_events
> > > >> > > > > @1705:
> > > >> > > > > > initiated connection to server [10.141.141.10:2181]
> > > >> > > > > > 2015-05-25
> > > >> 18:13:01,752:10746(0x7fd4ab7fe700):ZOO_INFO@check_events
> > > >> > > > > @1752:
> > > >> > > > > > session establishment complete on server
[
> > 10.141.141.10:2181
> > > ],
> > > >> > > > > > sessionId=0x14d8babef360022, negotiated timeout=10000
> > > >> > > > > > I0525 18:13:01.752760 10913 group.cpp:313]
Group process
> > > >> > > > > > (group(1)@127.0.0.1:48557) connected to ZooKeeper
> > > >> > > > > > I0525 18:13:01.752787 10913 group.cpp:790]
Syncing group
> > > >> > operations:
> > > >> > > > > > queue size (joins, cancels, datas) = (0,
0, 0)
> > > >> > > > > > I0525 18:13:01.752807 10913 group.cpp:385]
Trying to
> create
> > > path
> > > >> > > > > > '/mesos' in ZooKeeper
> > > >> > > > > > I0525 18:13:01.754317 10909 detector.cpp:138]
Detected a
> new
> > > >> > leader:
> > > >> > > > > > (id='16')
> > > >> > > > > > I0525 18:13:01.754408 10913 group.cpp:659]
Trying to get
> > > >> > > > > > '/mesos/info_0000000016' in ZooKeeper
> > > >> > > > > > I0525 18:13:01.755056 10913 detector.cpp:452]
A new
> leading
> > > >> master
> > > >> > > > > > (UPID=master@127.0.1.1:5050) is detected
> > > >> > > > > > I0525 18:13:01.755113 10911 sched.cpp:254]
New master
> > detected
> > > >> at
> > > >> > > > > > master@127.0.1.1:5050
> > > >> > > > > > I0525 18:13:01.755345 10911 sched.cpp:264]
No credentials
> > > >> provided.
> > > >> > > > > > Attempting to register without authentication
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > It hangs up in the last line.
> > > >> > > > > >
> > > >> > > > > > I've tried to set the LIBPROCESS_IP env variable
with no
> > luck.
> > > >> > > > > >
> > > >> > > > > > Any advice?
> > > >> > > > > >
> > > >> > > > > > Thank you in advance.
> > > >> > > > > >
> > > >> > > > > > Kind regards,
> > > >> > > > > >
> > > >> > > > > > Alberto
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message