singa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anh Dinh <dinh...@comp.nus.edu.sg>
Subject Re: Error while running singa on mesos
Date Wed, 22 Jun 2016 07:50:43 GMT
let's say you create a container node0 in machine A, and node1 in machine
B.

In node1, can you ping node0?

If you cannot, then Weaver wasn't running properly (with Docker v1.8.3).

Anh.


On 22 June 2016 at 15:42, Venkat Katta <skatta@adobe.com> wrote:

> As the docker containers are in different machines i can no longer make
> communications between docker containers as they ip's are internal to the
> machine. so i am using weaver which is written in documentation
> https://singa.incubator.apache.org/docs/docker.html#launch_distributed .
> It is trying to bind zsock on localhost not on node1 or node2.
>
> Regards,
> Venkat Satish Katta
> ------------------------------
> *From:* Anh Dinh <dinhtta@comp.nus.edu.sg>
> *Sent:* Wednesday, June 22, 2016 12:23:26 PM
> *To:* Venkat Katta
> *Cc:* Wang Wei; dev@singa.incubator.apache.org
>
> *Subject:* Re: Error while running singa on mesos
>
> We had problems with Docker version >= 1.9 (yours is even newer), as noted
> in https://singa.incubator.apache.org/docs/docker.html#launch_pseudo
>
> Basically new versions of Docker changed the DNS resolution mechanism: the
> Docker daemon no longer updates the /etc/hosts file of existing containers
> when new one is launched.
>
> One suggestion is to downgrade Docker to 1.8:
>
> sudo apt-get install docker-engine=1.8.3-0~trusty
>
> Another option is to enter IP addresses manually into /etc/hosts files.
> But we have not tried it with Weaver, so there's high chance that it won't
> work with Weaver.
>
>
> On 22 June 2016 at 14:39, Venkat Katta <skatta@adobe.com> wrote:
>
>> docker version : 1.11.2
>>
>> regards,
>> venkat satish katta
>> ------------------------------
>> *From:* Anh Dinh <dinhtta@comp.nus.edu.sg>
>> *Sent:* Wednesday, June 22, 2016 12:04:56 PM
>> *To:* Wang Wei; Venkat Katta
>>
>> *Cc:* dev@singa.incubator.apache.org
>> *Subject:* Re: Error while running singa on mesos
>>
>> what version of Docker are you running?
>>
>> Anh.
>>
>>
>> On 22 June 2016 at 14:26, Wang Wei <wangwei@apache.org> wrote:
>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Venkat Katta <skatta@adobe.com>
>>> Date: Wed, Jun 22, 2016 at 1:31 PM
>>> Subject: Re: Error while running singa on mesos
>>> To: Wang Wei <wangwei@apache.org>
>>>
>>>
>>> It works fine if I replace the node0 and node2 with their IP address. I
>>> am using weave for transparent communication between the containers.  In
>>> singa.conf to connect to zookeeper i used node0 but not the ipaddress of
>>> node0 it is able to connect why can't singa resolve the hostname. And while
>>> running singa with mesos it is using localhost rather ip address node1 and
>>> node2, also we are not giving any arguement while running the singa
>>>  regarding ip address of the slaves.
>>>
>>>
>>> F0622 05:18:28.932391  1513 socket.cc:98] Check failed: port != -1 (-1
>>> vs. -1) tcp://localhost:*
>>>
>>>
>>> Thanks,
>>>
>>> Venkat satish katta
>>> ------------------------------
>>> *From:* Wang Wei <wangwei@apache.org>
>>> *Sent:* Wednesday, June 22, 2016 8:46:36 AM
>>> *To:* Venkat Katta
>>>
>>> *Subject:* Re: Error while running singa on mesos
>>>
>>> If you are using Docker (withou mesos), it could be the problem of
>>> network routing. May need to configure the Docker to setup the network then
>>> node0 and node2 can be accessed from node1.
>>> We are trying your configuration.
>>>
>>> regards,
>>> wang wei
>>>
>>>
>>> On Wed, Jun 22, 2016 at 10:32 AM, Wang Wei <wangwei@apache.org> wrote:
>>>
>>>> Hi Venkat,
>>>>
>>>> It should be the problem of the node address.
>>>> Pls replace node0 and node2 with their IP addresses.
>>>>
>>>> regards,
>>>> wei
>>>>
>>>> On Wed, Jun 22, 2016 at 2:40 AM, Venkat Katta <skatta@adobe.com> wrote:
>>>>
>>>>> i tried running without mesos i got the same error
>>>>>
>>>>>
>>>>> root@node0:~/incubator-singa# ./bin/singa-run.sh -conf
>>>>> examples/cifar10/hybrid.conf
>>>>> Unique JOB_ID is 4
>>>>> Record job information to /tmp/singa-log/job-info/job-4-20160621-183305
>>>>> Executing @ node2 : cd /root/incubator-singa; source
>>>>> /root/incubator-singa/conf/profile; ./singa -singa_conf
>>>>> /root/incubator-singa/conf/singa.conf -singa_job 4 -conf
>>>>> /root/incubator-singa/examples/cifar10/hybrid.conf
>>>>> Executing @ node0 : cd /root/incubator-singa; source
>>>>> /root/incubator-singa/conf/profile; ./singa -singa_conf
>>>>> /root/incubator-singa/conf/singa.conf -singa_job 4 -conf
>>>>> /root/incubator-singa/examples/cifar10/hybrid.conf
>>>>> F0621 18:33:24.171468   725 socket.cc:98] Check failed: port != -1 (-1
>>>>> vs. -1) tcp://node2:*
>>>>> *** Check failure stack trace: ***
>>>>>     @     0x7f10d0a6b9fd  google::LogMessage::Fail()
>>>>>     @     0x7f10d0a6d89d  google::LogMessage::SendToLog()
>>>>>     @     0x7f10d0a6b5ec  google::LogMessage::Flush()
>>>>>     @     0x7f10d0a6e1be  google::LogMessageFatal::~LogMessageFatal()
>>>>>     @     0x7f10d0e05d79  singa::Router::Bind()
>>>>>     @     0x7f10d0d7a8bc  singa::Driver::Train()
>>>>>     @     0x7f10d0d7f48b  singa::Driver::Train()
>>>>>     @           0x40c915  main
>>>>>     @     0x7f10c5f13f45  (unknown)
>>>>>     @           0x40cb7e  (unknown)
>>>>> F0621 18:33:06.244278  1042 socket.cc:98] Check failed: port != -1 (-1
>>>>> vs. -1) tcp://node0:*
>>>>> *** Check failure stack trace: ***
>>>>>     @     0x7f6d4516d9fd  google::LogMessage::Fail()
>>>>>     @     0x7f6d4516f89d  google::LogMessage::SendToLog()
>>>>>     @     0x7f6d4516d5ec  google::LogMessage::Flush()
>>>>>     @     0x7f6d451701be  google::LogMessageFatal::~LogMessageFatal()
>>>>>     @     0x7f6d45507d79  singa::Router::Bind()
>>>>>     @     0x7f6d4547c8bc  singa::Driver::Train()
>>>>>     @     0x7f6d4548148b  singa::Driver::Train()
>>>>>     @           0x40c915  main
>>>>>     @     0x7f6d3a615f45  (unknown)
>>>>>     @           0x40cb7e  (unknown)
>>>>> bash: line 1:   725 Aborted                 (core dumped) ./singa
>>>>> -singa_conf /root/incubator-singa/conf/singa.conf -singa_job 4 -conf
>>>>> /root/incubator-singa/examples/cifar10/hybrid.conf -host node2
>>>>> bash: line 1:  1042 Aborted                 (core dumped) ./singa
>>>>> -singa_conf /root/incubator-singa/conf/singa.conf -singa_job 4 -conf
>>>>> /root/incubator-singa/examples/cifar10/hybrid.conf -host node0
>>>>> E0621 18:33:07.467438  1067 job_manager.cc:156] job 4 not exists
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> *From:* Wang Wei <wangwei@apache.org>
>>>>> *Sent:* Tuesday, June 21, 2016 7:09:46 PM
>>>>> *To:* Venkat Katta
>>>>> *Cc:* dev@singa.incubator.apache.org
>>>>> *Subject:* Re: Error while running singa on mesos
>>>>>
>>>>> Hi,
>>>>>
>>>>> Can you try to run it without Mesos?
>>>>> 1. Compile singa with enable-dist
>>>>> 2. change conf/singa.conf to set the zookeeper host
>>>>> 3. update the conf/hostfile one line per machine
>>>>> 4. update the conf/profile to export LD_LIBRARY_PATH
>>>>>
>>>>> regards,
>>>>> Wei
>>>>>
>>>>> On Tue, Jun 21, 2016 at 8:52 PM, Venkat Katta <skatta@adobe.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I am actually trying to run singa on mesos in fully distributed
>>>>>> architecture. I built the docker images as given in the documentation.
I am
>>>>>> using mesos 0.28.2 and singa 0.3-rc3.I am running each docker container
>>>>>> using --net=host flag so that they take the ip of the system. Singa
works
>>>>>> as long as the workers are all in one machine .
>>>>>> When I try to use two machines for training it shows error
>>>>>>
>>>>>>
>>>>>> F0617 10:00:43.862246 2742 socket.cc:98] Check failed: port != -1
(-1
>>>>>> vs. -1) tcp://localhost:*
>>>>>>
>>>>>>
>>>>>>   so while running the scheduler do we need to give it hostfile
>>>>>> containing all the hosts. How does it know the remaining hosts in
cluster.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> Venkat Satish Katta.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message