mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <benjamin.mah...@gmail.com>
Subject Re: Hello, I get stuck with trying mpi on mesos, Would u plz help me out?
Date Fri, 19 Jul 2013 01:42:09 GMT
FYI We no longer use svn, the git repository is here:
https://git-wip-us.apache.org/repos/asf/mesos.git

That said, I would recommend running on a stable release (like 0.12.0).

As for the problem you're facing, you should look in the executor sandbox
to see what the stderr logs say:
/tmp/mesos/slaves/201307181807-1402585610-5050-6498-0/frameworks/
201307181807-1402585610-5050-6498-0000/executors/0/runs/
19cd7e79-10e5-42f2-8ab5-89c99366c19a

Alternatively, did the webui show the task? If so, you can browse the
sandbox directly in the webui.



On Thu, Jul 18, 2013 at 3:15 AM, Ye Tao <ytao@ust.hk> wrote:

> Hi, here is the details:
>
> I use the latest mesos-trunk. (actually "svn co" several hours ago )
>
> I install mpich2-1.2 following the REAME file in mpi folder.
>
> I use an helloworld.c mpi example. u can find it in the attachment.
>
> I try mpiexec it works well:
> [image: 内嵌图片 1]
> then I try run it on mesos.
> I set this machine both master and slave.
>
> Then I start cluster.
> try the cmd in README:
> ./mpiexec-mesos 10.194.153.83:5050 ./helloworld
>
> It breaks down at slave node immediately:
> [image: 内嵌图片 3]
>
>
> I check the log file:
> I0718 18:08:27.232508  6672 slave.cpp:930] Queuing task '0' for executor 0
> of framework '201307181807-1402585610-5050-6498-0000
> I0718 18:08:27.232519  6679 process_isolator.cpp:120] Launching 0
> (/home/mesos/mesos/libexec/mesos/mesos-executor) in
> /tmp/mesos/slaves/201307181807-1402585610-5050-6498-0/frameworks/201307181807-1402585610-5050-6498-0000/executors/0/runs/19cd7e79-10e5-42f2-8ab5-89c99366c19a
> with resources ' for framework 201307181807-1402585610-5050-6498-0000
> I0718 18:08:27.233882  6684 slave.cpp:514] Successfully attached file
> '/tmp/mesos/slaves/201307181807-1402585610-5050-6498-0/frameworks/201307181807-1402585610-5050-6498-0000/executors/0/runs/19cd7e79-10e5-42f2-8ab5-89c99366c19a'
> I0718 18:08:27.234824  6679 process_isolator.cpp:183] Forked executor at
> 6776
> I0718 18:08:27.278260  6675 slave.cpp:1382] Got registration for executor
> '0' of framework 201307181807-1402585610-5050-6498-0000
> I0718 18:08:27.278591  6675 slave.cpp:1497] Flushing queued task 0 for
> executor '0' of framework 201307181807-1402585610-5050-6498-0000
> I0718 18:08:27.281960  6687 status_update_manager.cpp:289] Received status
> update TASK_RUNNING (UUID: 2ab26368-5521-497d-bd0f-578971f04931) for task 0
> of framework 201307181807-1402585610-5050-6498-0000 with checkpoint=false
> I0718 18:08:27.282130  6687 status_update_manager.cpp:449] Creating
> StatusUpdate stream for task 0 of framework
> 201307181807-1402585610-5050-6498-0000
> I0718 18:08:27.282393  6687 status_update_manager.cpp:335] Forwarding
> status update TASK_RUNNING (UUID: 2ab26368-5521-497d-bd0f-578971f04931) for
> task 0 of framework 201307181807-1402585610-5050-6498-0000 to
> master@10.194.153.83:5050
> I0718 18:08:27.282613  6694 slave.cpp:1789] Sending acknowledgement for
> status update TASK_RUNNING (UUID: 2ab26368-5521-497d-bd0f-578971f04931) for
> task 0 of framework 201307181807-1402585610-5050-6498-0000 to executor(1)@
> 10.194.153.83:36216
> I0718 18:08:27.282976  6690 status_update_manager.cpp:289] Received status
> update TASK_FAILED (UUID: 88eaee35-80d8-496a-ac81-e3099d78678d) for task 0
> of framework 201307181807-1402585610-5050-6498-0000 with checkpoint=false
> I0718 18:08:27.283299  6690 slave.cpp:1789] Sending acknowledgement for
> status update TASK_FAILED (UUID: 88eaee35-80d8-496a-ac81-e3099d78678d) for
> task 0 of framework 201307181807-1402585610-5050-6498-0000 to executor(1)@
> 10.194.153.83:36216
> I0718 18:08:27.283870  6683 status_update_manager.cpp:359] Received status
> update acknowledgement 2ab26368-5521-497d-bd0f-578971f04931 for task 0 of
> framework 201307181807-1402585610-5050-6498-0000
> I0718 18:08:27.283989  6683 status_update_manager.cpp:335] Forwarding
> status update TASK_FAILED (UUID: 88eaee35-80d8-496a-ac81-e3099d78678d) for
> task 0 of framework 201307181807-1402585610-5050-6498-0000 to
> master@10.194.153.83:5050
> I0718 18:08:27.349282  6690 slave.cpp:1101] Asked to shut down framework
> 201307181807-1402585610-5050-6498-0000 by master@10.194.153.83:5050
>
>
>
> It just suddenly failed...
> I shall attach the logs of slave and master.
> I follow the python code.
> The slave just tells master it fails and the master directly shutdown the
> framework.
>
> I can not figure out why.
>
> I really appreciate ur help!
>
> Best!
>
>
>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message