Mailing-List: contact user-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mesos.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAJ2SuL=6RqhaWtW5REkMN5s7d4r3ALa-03WfriXW=z=Vb+gN9A@mail.gmail.com>
References: 
 <CAJ2SuLn31EWCOakU3UU932TOVk_7NACg3qHbPj7-fg6tiC6K0A@mail.gmail.com>
	<CAOs_uxzmzBhmG-S_6Ht2PSMNS6WeG4qUh9yO01W4iUHm8ZV=Ng@mail.gmail.com>
	<CAJ2SuLnLCxMMf6rMew4s=z1GL40ebJQdo=0r8eA2Mfjt6BgKjg@mail.gmail.com>
	<CABWvHTWp14qBSSFCDNrOLWhYCb9hm9OeRQA+gxLJkOGLPqRBMQ@mail.gmail.com>
	<CAJ2SuL=vK2qwcn+AMTkAmaDFLuUJ0FWN4khJdF7HNA+enA=ZLg@mail.gmail.com>
	<CAOs_uxwS1dfjafNSCH0vYKOCozAjY210ysn_JLSTvrCDsdcc-Q@mail.gmail.com>
	<CAJ2SuLkP6gk9yOZmnfNgcEjJM7ujR0W-_FVf3gieCVSYS9sRrQ@mail.gmail.com>
	<CAOs_uxwgOv3buuEN+SwgNvhP0ARo5ZrzGih7pBZVxVxURsb0yg@mail.gmail.com>
	<CAJ2SuLn5wy=Oc2XfBHQa1JppejxO3Vpce2WNhqYpbA+5hFVbPA@mail.gmail.com>
	<CAOs_uxwqbMGKCUTiyz9JYPYVPdBFp696QipQLbsKP4T39_w3aA@mail.gmail.com>
	<CAJ2SuLmMN3H1stuAac3+o-YQOhHPnzqKLv9ikCyL1y_dn9TYoQ@mail.gmail.com>
	<CAOs_uxyq1pHsJPDAVqDjDVe_bsV1oiMn0X-WLV+ohaY2_b5nQA@mail.gmail.com>
	<CAJ2SuLnASrXySbsB_z_8H70JcDgxYPpB-anWntYzNDiZk+hUyw@mail.gmail.com>
	<CAOs_uxxhkuvhBKNwj+O7j5_3VoQP0dqkFyrAvRAz6i-4kBYz3A@mail.gmail.com>
	<CAJ2SuL=6RqhaWtW5REkMN5s7d4r3ALa-03WfriXW=z=Vb+gN9A@mail.gmail.com>
Date: Wed, 7 Oct 2015 18:12:05 +0800
Message-ID: 
 <CAOs_uxwKGYvP74JWX1RVTwTy39bRwNmzKGMAvHQPG64LpWzNFQ@mail.gmail.com>
Subject: Re: Running a task in Mesos cluster
From: Guangya Liu <gyliu513@gmail.com>
To: user@mesos.apache.org
Content-Type: multipart/alternative; boundary=f46d044304d27d9c30052180faf4

--f46d044304d27d9c30052180faf4
Content-Type: text/plain; charset=UTF-8

Hi Pradeep,

Can you please append more log for your master node? Just want to see what
is wrong with your master, why the framework start to failover?

Thanks,

Guangya

On Wed, Oct 7, 2015 at 5:27 PM, Pradeep Kiruvale <pradeepkiruvale@gmail.com>
wrote:

> Hi Guangya,
>
> I am running a frame work from some other physical node, which is part of
> the same network. Still I am getting below messages and the framework not
> getting registered.
>
> Any idea what is the reason?
>
> I1007 11:24:58.781914 32392 master.cpp:4815] Framework failover timeout,
> removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon
> Framework (C++)) at
> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
> I1007 11:24:58.781968 32392 master.cpp:5571] Removing framework
> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon Framework (C++)) at
> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
> I1007 11:24:58.782352 32392 hierarchical.hpp:552] Removed framework
> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019
> E1007 11:24:58.782577 32399 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
> I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [  ]
> I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added framework
> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020
> E1007 11:24:59.700253 32399 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
>
>
> Regards,
> Pradeep
>
>
> On 5 October 2015 at 13:51, Guangya Liu <gyliu513@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> I think that the problem might be caused by that you are running the lxc
>> container on master node and not sure if there are any port conflict or
>> what else wrong.
>>
>> For my case, I was running the client in a new node but not on master
>> node, perhaps you can have a try to put your client on a new node but not
>> on master node.
>>
>> Thanks,
>>
>> Guangya
>>
>>
>> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <
>> pradeepkiruvale@gmail.com> wrote:
>>
>>> Hi Guangya,
>>>
>>> Hmm!...That is strange in my case!
>>>
>>> If I run from the mesos-execute on one of the slave/master node then the
>>> tasks get their resources and they get scheduled well.
>>> But if I start the mesos-execute on another node which is neither
>>> slave/master then I have this issue.
>>>
>>> I am using an lxc container on master as a client to launch the tasks.
>>> This is also in the same network as master/slaves.
>>> And I just launch the task as you did. But the tasks are not getting
>>> scheduled.
>>>
>>>
>>> On master the logs are same as I sent you before
>>>
>>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>
>>> On both of the slaves I can see the below logs
>>>
>>> I1005 13:23:32.547987  4831 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0060 by master@192.168.0.102:5050
>>> W1005 13:23:32.548135  4831 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
>>> I1005 13:23:33.697707  4833 slave.cpp:3926] Current disk usage 3.60%.
>>> Max allowed age: 6.047984349521910days
>>> I1005 13:23:34.098599  4829 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0061 by master@192.168.0.102:5050
>>> W1005 13:23:34.098740  4829 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
>>> I1005 13:23:35.274569  4831 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0062 by master@192.168.0.102:5050
>>> W1005 13:23:35.274683  4831 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
>>> I1005 13:23:36.193964  4829 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0063 by master@192.168.0.102:5050
>>> W1005 13:23:36.194090  4829 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
>>> I1005 13:24:01.914788  4827 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0064 by master@192.168.0.102:5050
>>> W1005 13:24:01.914937  4827 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
>>> I1005 13:24:03.469974  4833 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0065 by master@192.168.0.102:5050
>>> W1005 13:24:03.470118  4833 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
>>> I1005 13:24:04.642654  4826 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0066 by master@192.168.0.102:5050
>>> W1005 13:24:04.642812  4826 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>
>>>
>>>
>>> On 5 October 2015 at 13:09, Guangya Liu <gyliu513@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> From your log, seems that the master process is exiting and this caused
>>>> the framework fail over to another mesos master. Can you please show more
>>>> detail for your issue reproduced steps?
>>>>
>>>> I did some test by running mesos-execute on a client host which does
>>>> not have any mesos service and the task can schedule well.
>>>>
>>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 10"
>>>> --resources="cpus(*):1;mem(*):256"
>>>> I1005 18:59:47.974123  1233 sched.cpp:164] Version: 0.26.0
>>>> I1005 18:59:47.990890  1248 sched.cpp:262] New master detected at
>>>> master@192.168.0.107:5050
>>>> I1005 18:59:47.993074  1248 sched.cpp:272] No credentials provided.
>>>> Attempting to register without authentication
>>>> I1005 18:59:48.001194  1249 sched.cpp:641] Framework registered with
>>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>> task cluster-test submitted to slave
>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>>>> Received status update TASK_RUNNING for task cluster-test
>>>> Received status update TASK_FINISHED for task cluster-test
>>>> I1005 18:59:58.431144  1249 sched.cpp:1771] Asked to stop the driver
>>>> I1005 18:59:58.431591  1249 sched.cpp:1040] Stopping framework
>>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>>>> root      1259  1159  0 19:06 pts/0    00:00:00 grep --color=auto mesos
>>>>
>>>> Thanks,
>>>>
>>>> Guangya
>>>>
>>>>
>>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>>>> pradeepkiruvale@gmail.com> wrote:
>>>>
>>>>> Hi Guangya,
>>>>>
>>>>> I am facing one more issue. If I try to schedule the tasks from some
>>>>> external client system running the same cli mesos-execute.
>>>>> The tasks are not getting launched. The tasks reach the Master and it
>>>>> just drops the requests, below are the logs related to that
>>>>>
>>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework
>>>>>  with checkpointing disabled and capabilities [  ]
>>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown
>>>>> socket with fd 14: Transport endpoint is not connected
>>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>> disconnected
>>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown
>>>>> socket with fd 14: Transport endpoint is not connected
>>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns to
>>>>> failover
>>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning
>>>>> resources offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> because the framework has terminated or is inactive
>>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered
>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total:
>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated:
>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered
>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total:
>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated:
>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover
>>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>
>>>>>
>>>>> Can you please tell me what is the reason? The client is in the same
>>>>> network as well. But it does not run any master or slave processes.
>>>>>
>>>>> Thanks & Regards,
>>>>> Pradeeep
>>>>>
>>>>> On 5 October 2015 at 12:13, Guangya Liu <gyliu513@gmail.com> wrote:
>>>>>
>>>>>> Hi Pradeep,
>>>>>>
>>>>>> Glad it finally works! Not sure if you are using systemd.slice or
>>>>>> not, are you running to this issue:
>>>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>>>
>>>>>> Hope Jie Yu can give you some help on this ;-)
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Guangya
>>>>>>
>>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guangya,
>>>>>>>
>>>>>>>
>>>>>>> Thanks for sharing the information.
>>>>>>>
>>>>>>> Now I could launch the tasks. The problem was with the permission.
>>>>>>> If I start all the slaves and Master as root it works fine.
>>>>>>> Else I have problem with launching the tasks.
>>>>>>>
>>>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>>>> facing the following issue.
>>>>>>>
>>>>>>> Failed to create a containerizer: Could not create
>>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux
>>>>>>> launcher: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/freezer':
>>>>>>> 'freezer' is already attached to another hierarchy
>>>>>>>
>>>>>>> I took that out from the cluster for now. The tasks are getting
>>>>>>> scheduled on the other two slave nodes.
>>>>>>>
>>>>>>> Thanks for your timely help
>>>>>>>
>>>>>>> -Pradeep
>>>>>>>
>>>>>>> On 5 October 2015 at 10:54, Guangya Liu <gyliu513@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Pradeep,
>>>>>>>>
>>>>>>>> My steps was pretty simple just as
>>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>>>
>>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>>>  ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build#
>>>>>>>> GLOG_v=1 ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>>>
>>>>>>>> Then schedule a task on any of the node, here I was using slave
>>>>>>>> node mesos007, you can see that the two tasks was launched on different
>>>>>>>> host.
>>>>>>>>
>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>> I1005 16:49:11.013432  2971 sched.cpp:164] Version: 0.26.0
>>>>>>>> I1005 16:49:11.027802  2992 sched.cpp:262] New master detected at
>>>>>>>> master@192.168.0.107:5050
>>>>>>>> I1005 16:49:11.029579  2992 sched.cpp:272] No credentials provided.
>>>>>>>> Attempting to register without authentication
>>>>>>>> I1005 16:49:11.038182  2985 sched.cpp:641] Framework registered
>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>> task cluster-test submitted to slave
>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0  <<<<<<<<<<<<<<<<<<
>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>> ^C
>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>> I1005 16:50:18.346984  3036 sched.cpp:164] Version: 0.26.0
>>>>>>>> I1005 16:50:18.366114  3055 sched.cpp:262] New master detected at
>>>>>>>> master@192.168.0.107:5050
>>>>>>>> I1005 16:50:18.368010  3055 sched.cpp:272] No credentials provided.
>>>>>>>> Attempting to register without authentication
>>>>>>>> I1005 16:50:18.376338  3056 sched.cpp:641] Framework registered
>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>> task cluster-test submitted to slave
>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Guangya
>>>>>>>>
>>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Guangya,
>>>>>>>>>
>>>>>>>>> Thanks for your reply.
>>>>>>>>>
>>>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>>>
>>>>>>>>> 1. What processes you have started on Master?
>>>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>>>
>>>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>>>> What I am missing is some configuration steps.
>>>>>>>>>
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Pradeep
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gyliu513@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>
>>>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>>>> randomly on the three slave hosts, every time may have different result.
>>>>>>>>>> The logic is here:
>>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>>>> The allocator will help random shuffle the slaves every time
>>>>>>>>>> when allocate resources for offers.
>>>>>>>>>>
>>>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of
>>>>>>>>>> your slaves have enough resources? If you want your task run on other
>>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Ondrej,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your reply
>>>>>>>>>>>
>>>>>>>>>>> I did solve that issue, yes you are right there was an issue
>>>>>>>>>>> with slave IP address setting.
>>>>>>>>>>>
>>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try
>>>>>>>>>>> to schedule a task using
>>>>>>>>>>>
>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>
>>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>
>>>>>>>>>>>  I just start the mesos slaves like below
>>>>>>>>>>>
>>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>>>>  --hostname=slave1
>>>>>>>>>>>
>>>>>>>>>>> If I submit the task using the above (mesos-execute) command
>>>>>>>>>>> from same as one of the slave it runs on that system.
>>>>>>>>>>>
>>>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>>>> just that system and queues the tasks not runs on the other slaves.
>>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>>>
>>>>>>>>>>> Do I need to start some process to push the task on all the
>>>>>>>>>>> slaves equally? Am I missing something here?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Pradeep
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <ondrej.smola@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>
>>>>>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>>>>>> resolves your hostname - there are several solutions  (let say your node ip
>>>>>>>>>>>> is 192.168.56.128)
>>>>>>>>>>>>
>>>>>>>>>>>> 1)  export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>>>> 2)  set mesos options - ip, hostname
>>>>>>>>>>>>
>>>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>>>
>>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>>>
>>>>>>>>>>>> for more configuration options see
>>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  7410 master.cpp:5977] Removed slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>>>>>>> registered at the same address
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>>>> getting registered and de-registered to make a room for the next node. I
>>>>>>>>>>>>> can even see this on
>>>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I1002 10:01:12.753865  7416 leveldb.cpp:343] Persisting action
>>>>>>>>>>>>> (18 bytes) to leveldb took 104089ns
>>>>>>>>>>>>> I1002 10:01:12.753885  7416 replica.cpp:679] Persisted action
>>>>>>>>>>>>> at 384
>>>>>>>>>>>>> E1002 10:01:12.753891  7417 process.cpp:1912] Failed to
>>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>>>> I1002 10:01:12.753988  7413 master.cpp:3930] Registered slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>>>> I1002 10:01:12.754065  7413 master.cpp:1080] Slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>>>> I1002 10:01:12.754072  7416 hierarchical.hpp:675] Added slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>>>>>>> I1002 10:01:12.754084  7413 master.cpp:2534] Disconnecting
>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>> E1002 10:01:12.754118  7417 process.cpp:1912] Failed to
>>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>>>> I1002 10:01:12.754132  7413 master.cpp:2553] Deactivating
>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>> I1002 10:01:12.754237  7416 hierarchical.hpp:768] Slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>>>> I1002 10:01:12.754240  7413 replica.cpp:658] Replica received
>>>>>>>>>>>>> learned notice for position 384
>>>>>>>>>>>>> I1002 10:01:12.754360  7413 leveldb.cpp:343] Persisting action
>>>>>>>>>>>>> (20 bytes) to leveldb took 95171ns
>>>>>>>>>>>>> I1002 10:01:12.754395  7413 leveldb.cpp:401] Deleting ~2 keys
>>>>>>>>>>>>> from leveldb took 20333ns
>>>>>>>>>>>>> I1002 10:01:12.754406  7413 replica.cpp:679] Persisted action
>>>>>>>>>>>>> at 384
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gyliu513@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Guangya
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1
>>>>>>>>>>>>>>> Master and 3 Slaves.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves
>>>>>>>>>>>>>>> run on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves)
>>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>>>>  The other nodes resources are not visible. Some times
>>>>>>>>>>>>>>> visible but in a de-actived state.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>>>> mesos-master? There should be some logs in either master or slave telling
>>>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes
>>>>>>>>>>>>>>> are in the same network. *
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The
>>>>>>>>>>>>>>> resources from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Based on your previous question, there is only one node in
>>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We need first
>>>>>>>>>>>>>> identify what is wrong with other three nodes first.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I*s it required to register the frameworks from every slave
>>>>>>>>>>>>>>> node on the Master?*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--f46d044304d27d9c30052180faf4
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Pradeep,<div><br></div><div>Can you please append more =
log for your master node? Just want to see what is wrong with your master, =
why the framework start to failover?</div><div><br></div><div>Thanks,</div>=
<div><br></div><div>Guangya</div></div><div class=3D"gmail_extra"><br><div =
class=3D"gmail_quote">On Wed, Oct 7, 2015 at 5:27 PM, Pradeep Kiruvale <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:pradeepkiruvale@gmail.com" target=3D"_b=
lank">pradeepkiruvale@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr">Hi Guangya,<div><br></div><div>I am running =
a frame work from some other physical node, which is part of the same netwo=
rk. Still I am getting below messages and the framework not getting registe=
red.</div><div><br></div><div>Any idea what is the reason?</div><div><br></=
div><div><div>I1007 11:24:58.781914 32392 master.cpp:4815] Framework failov=
er timeout, removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (B=
alloon Framework (C++)) at <a href=3D"http://scheduler-3848d80c-8d27-48e0-a=
6b7-7e1678d5401d@127.0.1.1:54203" target=3D"_blank">scheduler-3848d80c-8d27=
-48e0-a6b7-7e1678d5401d@127.0.1.1:54203</a></div><div>I1007 11:24:58.781968=
 32392 master.cpp:5571] Removing framework 89b179d8-9fb7-4a61-ad03-a9a55254=
82ff-0019 (Balloon Framework (C++)) at <a href=3D"http://scheduler-3848d80c=
-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203" target=3D"_blank">scheduler-3=
848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203</a></div><div>I1007 11:=
24:58.782352 32392 hierarchical.hpp:552] Removed framework 89b179d8-9fb7-4a=
61-ad03-a9a5525482ff-0019</div><div>E1007 11:24:58.782577 32399 process.cpp=
:1912] Failed to shutdown socket with fd 13: Transport endpoint is not conn=
ected</div><div>I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSC=
RIBE call for framework &#39;Balloon Framework (C++)&#39; at <a href=3D"htt=
p://scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203" target=
=3D"_blank">scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203<=
/a></div><div>I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing fram=
ework Balloon Framework (C++) with checkpointing disabled and capabilities =
[ =C2=A0]</div><div>I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added=
 framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020</div><div>E1007 11:24:=
59.700253 32399 process.cpp:1912] Failed to shutdown socket with fd 13: Tra=
nsport endpoint is not connected</div></div><div><br></div><div><br></div><=
div>Regards,</div><div>Pradeep</div><div><br></div></div><div class=3D"HOEn=
Zb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_qu=
ote">On 5 October 2015 at 13:51, Guangya Liu <span dir=3D"ltr">&lt;<a href=
=3D"mailto:gyliu513@gmail.com" target=3D"_blank">gyliu513@gmail.com</a>&gt;=
</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Pradee=
p,<div><br></div><div>I think that the problem might be caused by that you =
are running the lxc container on master node and not sure if there are any =
port conflict or what else wrong.=C2=A0<br><br>For my case, I was running t=
he client in a new node but not on master node, perhaps you can have a try =
to put your client on a new node but not on master node.</div><div><br></di=
v><div>Thanks,</div><div><br></div><div>Guangya<div><div><br><div class=3D"=
gmail_extra"><br><div class=3D"gmail_quote">On Mon, Oct 5, 2015 at 7:30 PM,=
 Pradeep Kiruvale <span dir=3D"ltr">&lt;<a href=3D"mailto:pradeepkiruvale@g=
mail.com" target=3D"_blank">pradeepkiruvale@gmail.com</a>&gt;</span> wrote:=
<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-lef=
t:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Guangya,<div><br></d=
iv><div>Hmm!...That is strange in my case!<br><div><br></div><div>If I run =
from the mesos-execute on one of the slave/master node then the tasks get t=
heir resources and they get scheduled well.</div><div>But if I start the me=
sos-execute on another node which is neither slave/master then I have this =
issue.</div><div><br></div><div>I am using an lxc container on master as a =
client to launch the tasks. This is also in the same network as master/slav=
es.</div></div><div>And I just launch the task as you did. But the tasks ar=
e not getting scheduled.</div><div><br></div><div><br></div><div>On master =
the logs are same as I sent you before=C2=A0</div><div><br></div><div>Deact=
ivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066<br></div><div><=
br></div><div>On both of the slaves I can see the below logs</div><div><br>=
</div><div><div>I1005 13:23:32.547987 =C2=A04831 slave.cpp:1980] Asked to s=
hut down framework 77539063-89ce-4efa-a20b-ca788abbd912-0060 by <a href=3D"=
http://master@192.168.0.102:5050" target=3D"_blank">master@192.168.0.102:50=
50</a></div><div>W1005 13:23:32.548135 =C2=A04831 slave.cpp:1995] Cannot sh=
ut down unknown framework 77539063-89ce-4efa-a20b-ca788abbd912-0060</div><d=
iv>I1005 13:23:33.697707 =C2=A04833 slave.cpp:3926] Current disk usage 3.60=
%. Max allowed age: 6.047984349521910days</div><div>I1005 13:23:34.098599 =
=C2=A04829 slave.cpp:1980] Asked to shut down framework 77539063-89ce-4efa-=
a20b-ca788abbd912-0061 by <a href=3D"http://master@192.168.0.102:5050" targ=
et=3D"_blank">master@192.168.0.102:5050</a></div><div>W1005 13:23:34.098740=
 =C2=A04829 slave.cpp:1995] Cannot shut down unknown framework 77539063-89c=
e-4efa-a20b-ca788abbd912-0061</div><div>I1005 13:23:35.274569 =C2=A04831 sl=
ave.cpp:1980] Asked to shut down framework 77539063-89ce-4efa-a20b-ca788abb=
d912-0062 by <a href=3D"http://master@192.168.0.102:5050" target=3D"_blank"=
>master@192.168.0.102:5050</a></div><div>W1005 13:23:35.274683 =C2=A04831 s=
lave.cpp:1995] Cannot shut down unknown framework 77539063-89ce-4efa-a20b-c=
a788abbd912-0062</div><div>I1005 13:23:36.193964 =C2=A04829 slave.cpp:1980]=
 Asked to shut down framework 77539063-89ce-4efa-a20b-ca788abbd912-0063 by =
<a href=3D"http://master@192.168.0.102:5050" target=3D"_blank">master@192.1=
68.0.102:5050</a></div><div>W1005 13:23:36.194090 =C2=A04829 slave.cpp:1995=
] Cannot shut down unknown framework 77539063-89ce-4efa-a20b-ca788abbd912-0=
063</div><div>I1005 13:24:01.914788 =C2=A04827 slave.cpp:1980] Asked to shu=
t down framework 77539063-89ce-4efa-a20b-ca788abbd912-0064 by <a href=3D"ht=
tp://master@192.168.0.102:5050" target=3D"_blank">master@192.168.0.102:5050=
</a></div><div>W1005 13:24:01.914937 =C2=A04827 slave.cpp:1995] Cannot shut=
 down unknown framework 77539063-89ce-4efa-a20b-ca788abbd912-0064</div><div=
>I1005 13:24:03.469974 =C2=A04833 slave.cpp:1980] Asked to shut down framew=
ork 77539063-89ce-4efa-a20b-ca788abbd912-0065 by <a href=3D"http://master@1=
92.168.0.102:5050" target=3D"_blank">master@192.168.0.102:5050</a></div><di=
v>W1005 13:24:03.470118 =C2=A04833 slave.cpp:1995] Cannot shut down unknown=
 framework 77539063-89ce-4efa-a20b-ca788abbd912-0065</div><div>I1005 13:24:=
04.642654 =C2=A04826 slave.cpp:1980] Asked to shut down framework 77539063-=
89ce-4efa-a20b-ca788abbd912-0066 by <a href=3D"http://master@192.168.0.102:=
5050" target=3D"_blank">master@192.168.0.102:5050</a></div><div>W1005 13:24=
:04.642812 =C2=A04826 slave.cpp:1995] Cannot shut down unknown framework 77=
539063-89ce-4efa-a20b-ca788abbd912-0066</div></div><div><br></div><div><br>=
</div></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_qu=
ote">On 5 October 2015 at 13:09, Guangya Liu <span dir=3D"ltr">&lt;<a href=
=3D"mailto:gyliu513@gmail.com" target=3D"_blank">gyliu513@gmail.com</a>&gt;=
</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Pradee=
p,<div><br></div><div>From your log, seems that the master process is exiti=
ng and this caused the framework fail over to another mesos master. Can you=
 please show more detail for your issue reproduced steps?</div><div><br></d=
iv><div>I did some test by running mesos-execute on a client host which doe=
s not have any mesos service and the task can schedule well.</div><div><br>=
</div><div><div>root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execu=
te --master=3D<a href=3D"http://192.168.0.107:5050" target=3D"_blank">192.1=
68.0.107:5050</a> --name=3D&quot;cluster-test&quot; --command=3D&quot;/bin/=
sleep 10&quot; --resources=3D&quot;cpus(*):1;mem(*):256&quot;</div><div>I10=
05 18:59:47.974123 =C2=A01233 sched.cpp:164] Version: 0.26.0</div><div>I100=
5 18:59:47.990890 =C2=A01248 sched.cpp:262] New master detected at <a href=
=3D"http://master@192.168.0.107:5050" target=3D"_blank">master@192.168.0.10=
7:5050</a></div><div>I1005 18:59:47.993074 =C2=A01248 sched.cpp:272] No cre=
dentials provided. Attempting to register without authentication</div><div>=
I1005 18:59:48.001194 =C2=A01249 sched.cpp:641] Framework registered with 0=
4b9af5e-e9b6-4c59-8734-eba407163922-0002</div><div>Framework registered wit=
h 04b9af5e-e9b6-4c59-8734-eba407163922-0002</div><span><div>task cluster-te=
st submitted to slave c0e5fdde-595e-4768-9d04-25901d4523b6-S0</div><div>Rec=
eived status update TASK_RUNNING for task cluster-test</div></span><div>Rec=
eived status update TASK_FINISHED for task cluster-test</div><div>I1005 18:=
59:58.431144 =C2=A01249 sched.cpp:1771] Asked to stop the driver</div><div>=
I1005 18:59:58.431591 =C2=A01249 sched.cpp:1040] Stopping framework &#39;04=
b9af5e-e9b6-4c59-8734-eba407163922-0002&#39;</div><div>root@mesos008:~/src/=
mesos/m1/mesos/build# ps -ef | grep mesos</div><div>root =C2=A0 =C2=A0 =C2=
=A01259 =C2=A01159 =C2=A00 19:06 pts/0 =C2=A0 =C2=A000:00:00 grep --color=
=3Dauto mesos</div></div><div><br></div><div>Thanks,</div><div><br></div><d=
iv>Guangya</div><div><br></div></div><div><div><div class=3D"gmail_extra"><=
br><div class=3D"gmail_quote">On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruv=
ale <span dir=3D"ltr">&lt;<a href=3D"mailto:pradeepkiruvale@gmail.com" targ=
et=3D"_blank">pradeepkiruvale@gmail.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><div dir=3D"ltr">Hi Guangya,<div><br></div><div>I am f=
acing one more issue. If I try to schedule the tasks from some external cli=
ent system running the same cli mesos-execute.</div><div>The tasks are not =
getting launched. The tasks reach the Master and it just drops the requests=
, below are the logs related to that</div><div><br></div><div><div>I1005 11=
:33:35.025594 21369 master.cpp:2250] Subscribing framework =C2=A0with check=
pointing disabled and capabilities [ =C2=A0]</div><div>E1005 11:33:35.02610=
0 21373 process.cpp:1912] Failed to shutdown socket with fd 14: Transport e=
ndpoint is not connected</div><div>I1005 11:33:35.026129 21372 hierarchical=
.hpp:515] Added framework 77539063-89ce-4efa-a20b-ca788abbd912-0055</div><d=
iv>I1005 11:33:35.026298 21369 master.cpp:1119] Framework 77539063-89ce-4ef=
a-a20b-ca788abbd912-0055 () at <a href=3D"http://scheduler-b1bc0243-b5be-44=
ae-894c-ca318c24ce6d@127.0.1.1:47259" target=3D"_blank">scheduler-b1bc0243-=
b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259</a> disconnected</div><div>I100=
5 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework 77539063-8=
9ce-4efa-a20b-ca788abbd912-0055 () at <a href=3D"http://scheduler-b1bc0243-=
b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259" target=3D"_blank">scheduler-b1=
bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259</a></div><div>I1005 11:3=
3:35.026340 21369 master.cpp:2499] Deactivating framework 77539063-89ce-4ef=
a-a20b-ca788abbd912-0055 () at <a href=3D"http://scheduler-b1bc0243-b5be-44=
ae-894c-ca318c24ce6d@127.0.1.1:47259" target=3D"_blank">scheduler-b1bc0243-=
b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259</a></div><div>E1005 11:33:35.02=
6345 21373 process.cpp:1912] Failed to shutdown socket with fd 14: Transpor=
t endpoint is not connected</div><div>I1005 11:33:35.026376 21369 master.cp=
p:1143] Giving framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at <a=
 href=3D"http://scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47=
259" target=3D"_blank">scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0=
.1.1:47259</a> 0ns to failover</div><div>I1005 11:33:35.026743 21372 hierar=
chical.hpp:599] Deactivated framework 77539063-89ce-4efa-a20b-ca788abbd912-=
0055</div><div>W1005 11:33:35.026757 21368 master.cpp:4828] Master returnin=
g resources offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 =
because the framework has terminated or is inactive</div><div>I1005 11:33:3=
5.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8; mem(*):14868; di=
sk(*):218835; ports(*):[31000-32000] (total: cpus(*):8; mem(*):14868; disk(=
*):218835; ports(*):[31000-32000], allocated: ) on slave 77539063-89ce-4efa=
-a20b-ca788abbd912-S2 from framework 77539063-89ce-4efa-a20b-ca788abbd912-0=
055</div><div>I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered =
cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpu=
s(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) =
on slave 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework 77539063-89=
ce-4efa-a20b-ca788abbd912-0055</div><div>I1005 11:33:35.027668 21366 master=
.cpp:4815] Framework failover timeout, removing framework 77539063-89ce-4ef=
a-a20b-ca788abbd912-0055 () at <a href=3D"http://scheduler-b1bc0243-b5be-44=
ae-894c-ca318c24ce6d@127.0.1.1:47259" target=3D"_blank">scheduler-b1bc0243-=
b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259</a></div><div>I1005 11:33:35.02=
7715 21366 master.cpp:5571] Removing framework 77539063-89ce-4efa-a20b-ca78=
8abbd912-0055 () at <a href=3D"http://scheduler-b1bc0243-b5be-44ae-894c-ca3=
18c24ce6d@127.0.1.1:47259" target=3D"_blank">scheduler-b1bc0243-b5be-44ae-8=
94c-ca318c24ce6d@127.0.1.1:47259</a></div></div><div><br></div><div><br></d=
iv><div>Can you please tell me what is the reason? The client is in the sam=
e network as well. But it does not run any master or slave processes.=C2=A0=
</div><div><br></div><div>Thanks &amp; Regards,</div><div>Pradeeep</div></d=
iv><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On 5=
 October 2015 at 12:13, Guangya Liu <span dir=3D"ltr">&lt;<a href=3D"mailto=
:gyliu513@gmail.com" target=3D"_blank">gyliu513@gmail.com</a>&gt;</span> wr=
ote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Pradeep,<div><br=
></div><div>Glad it finally works! Not sure if you are using systemd.slice =
or not, are you running to this issue:=C2=A0<a href=3D"https://issues.apach=
e.org/jira/browse/MESOS-1195" target=3D"_blank">https://issues.apache.org/j=
ira/browse/MESOS-1195</a></div><div><br></div><div>Hope Jie Yu can give you=
 some help on this ;-)</div><div><br></div><div>Thanks,</div><div><br></div=
><div>Guangya</div></div><div><div><div class=3D"gmail_extra"><br><div clas=
s=3D"gmail_quote">On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <span di=
r=3D"ltr">&lt;<a href=3D"mailto:pradeepkiruvale@gmail.com" target=3D"_blank=
">pradeepkiruvale@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr">Hi Guangya,<div><br></div><div><br></div><div>Tha=
nks for sharing the information.</div><div><br></div><div>Now I could launc=
h the tasks. The problem was with the permission. If I start all the slaves=
 and Master as root it works fine.</div><div>Else I have problem with launc=
hing the tasks.</div><div><br></div><div>But on one of the slave I could no=
t launch the slave as root, I am facing the following issue.</div><div><br>=
</div><div>Failed to create a containerizer: Could not create MesosContaine=
rizer: Failed to create launcher: Failed to create Linux launcher: Failed t=
o mount cgroups hierarchy at &#39;/sys/fs/cgroup/freezer&#39;: &#39;freezer=
&#39; is already attached to another hierarchy<br></div><div><br></div><div=
>I took that out from the cluster for now. The tasks are getting scheduled =
on the other two slave nodes.</div><div><br></div><div>Thanks for your time=
ly help</div><span><font color=3D"#888888"><div><br></div><div>-Pradeep</di=
v></font></span></div><div><div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On 5 October 2015 at 10:54, Guangya Liu <span dir=3D"ltr">=
&lt;<a href=3D"mailto:gyliu513@gmail.com" target=3D"_blank">gyliu513@gmail.=
com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr=
">Hi Pradeep,<div><br></div><div>My steps was pretty simple just as=C2=A0<a=
 href=3D"https://github.com/apache/mesos/blob/master/docs/getting-started.m=
d#examples" target=3D"_blank">https://github.com/apache/mesos/blob/master/d=
ocs/getting-started.md#examples</a></div><div><br></div><div>On Master node=
:=C2=A0root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=3D1 =C2=A0./bin/mesos=
-master.sh --ip=3D192.168.0.107 --work_dir=3D/var/lib/mesos</div><div>On 3 =
Slave node:=C2=A0root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=3D1 ./bin=
/mesos-slave.sh --master=3D<a href=3D"http://192.168.0.107:5050" target=3D"=
_blank">192.168.0.107:5050</a></div><div><br></div><div>Then schedule a tas=
k on any of the node, here I was using slave node mesos007, you can see tha=
t the two tasks was launched on different host.</div><div><br></div><div><d=
iv>root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
=3D<a href=3D"http://192.168.0.107:5050" target=3D"_blank">192.168.0.107:50=
50</a> --name=3D&quot;cluster-test&quot; --command=3D&quot;/bin/sleep 100&q=
uot; --resources=3D&quot;cpus(*):1;mem(*):256&quot;</div><div>I1005 16:49:1=
1.013432 =C2=A02971 sched.cpp:164] Version: 0.26.0</div><div>I1005 16:49:11=
.027802 =C2=A02992 sched.cpp:262] New master detected at <a href=3D"http://=
master@192.168.0.107:5050" target=3D"_blank">master@192.168.0.107:5050</a><=
/div><div>I1005 16:49:11.029579 =C2=A02992 sched.cpp:272] No credentials pr=
ovided. Attempting to register without authentication</div><div>I1005 16:49=
:11.038182 =C2=A02985 sched.cpp:641] Framework registered with c0e5fdde-595=
e-4768-9d04-25901d4523b6-0002</div><div>Framework registered with c0e5fdde-=
595e-4768-9d04-25901d4523b6-0002</div><div><font color=3D"#0000ff">task clu=
ster-test submitted to slave c0e5fdde-595e-4768-9d04-25901d4523b6-S0 =C2=A0=
&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;</f=
ont></div><div>Received status update TASK_RUNNING for task cluster-test</d=
iv><div>^C</div><div>root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-=
execute --master=3D<a href=3D"http://192.168.0.107:5050" target=3D"_blank">=
192.168.0.107:5050</a> --name=3D&quot;cluster-test&quot; --command=3D&quot;=
/bin/sleep 100&quot; --resources=3D&quot;cpus(*):1;mem(*):256&quot;</div><d=
iv>I1005 16:50:18.346984 =C2=A03036 sched.cpp:164] Version: 0.26.0</div><di=
v>I1005 16:50:18.366114 =C2=A03055 sched.cpp:262] New master detected at <a=
 href=3D"http://master@192.168.0.107:5050" target=3D"_blank">master@192.168=
.0.107:5050</a></div><div>I1005 16:50:18.368010 =C2=A03055 sched.cpp:272] N=
o credentials provided. Attempting to register without authentication</div>=
<div>I1005 16:50:18.376338 =C2=A03056 sched.cpp:641] Framework registered w=
ith c0e5fdde-595e-4768-9d04-25901d4523b6-0003</div><div>Framework registere=
d with c0e5fdde-595e-4768-9d04-25901d4523b6-0003</div><div><font color=3D"#=
0000ff">task cluster-test submitted to slave c0e5fdde-595e-4768-9d04-25901d=
4523b6-S1 &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&=
lt;&lt;&lt;&lt;</font></div><div>Received status update TASK_RUNNING for ta=
sk cluster-test</div></div><div><br></div><div>Thanks,</div><div><br></div>=
<div>Guangya</div><div><div><div class=3D"gmail_extra"><br><div class=3D"gm=
ail_quote">On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <span dir=3D"lt=
r">&lt;<a href=3D"mailto:pradeepkiruvale@gmail.com" target=3D"_blank">prade=
epkiruvale@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-co=
lor:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir=3D"=
ltr">Hi Guangya,<div><br></div><div>Thanks for your reply.</div><div><br></=
div><div>I just want to know how did you launch the tasks.</div><div><br></=
div><div>1. What processes you have started on Master?</div><div>2. What ar=
e the processes you have started on Slaves?</div><div><br></div><div>I am m=
issing something here, otherwise all my slave have enough memory and cpus t=
o launch the tasks I mentioned.</div><div>What I am missing is some configu=
ration steps.=C2=A0</div><div><br></div><div>Thanks &amp; Regards,</div><di=
v>Pradeep</div><div><br></div></div><div><div><div class=3D"gmail_extra"><b=
r><div class=3D"gmail_quote">On 3 October 2015 at 13:14, Guangya Liu <span =
dir=3D"ltr">&lt;<a href=3D"mailto:gyliu513@gmail.com" target=3D"_blank">gyl=
iu513@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:r=
gb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir=3D"ltr">=
<div><div><div>Hi Pradeep,<br><br></div>I did some test with your case and =
found that the task can run randomly on the three slave hosts, every time m=
ay have different result. The logic is here: <a href=3D"https://github.com/=
apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-=
%23L1266" target=3D"_blank">https://github.com/apache/mesos/blob/master/src=
/master/allocator/mesos/hierarchical.hpp#L1263-#L1266</a> The allocator wil=
l help <span>random shuffle the slaves every time when allocate resources f=
or offers.<br><br></span></div><span>I see that every of your task need the=
 minimum resources as &quot;</span><span><span>resources=3D&quot;cpus(*):3;=
mem(*):2560</span>&quot;, can you help check if all of your slaves have eno=
ugh resources? If you want your task run on other slaves, then those slaves=
 need to have at least 3 cpus and 2550M memory.<br><br></span></div><span>T=
hanks<br></span></div><div><div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <span dir=
=3D"ltr">&lt;<a href=3D"mailto:pradeepkiruvale@gmail.com" target=3D"_blank"=
>pradeepkiruvale@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-l=
eft-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div d=
ir=3D"ltr">Hi Ondrej,<div><br></div><div>Thanks for your reply<br><div><br>=
</div><div>I did solve that issue, yes you are right there was an issue wit=
h slave IP address setting.</div><div><br></div><div>Now I am facing issue =
with the scheduling the tasks. When I try to schedule a task using=C2=A0</d=
iv><span><div><div><br></div><div>/src/mesos-execute --master=3D<a href=3D"=
http://192.168.0.102:5050" target=3D"_blank">192.168.0.102:5050</a> --name=
=3D&quot;cluster-test&quot; --command=3D&quot;/usr/bin/hackbench -s 4096 -l=
 10845760 -g 2 -f 2 -P&quot; --resources=3D&quot;cpus(*):3;mem(*):2560&quot=
;</div><div><br></div><div>The tasks always get scheduled on the same node.=
 The resources from the other nodes are not getting used to schedule the ta=
sks.</div></div><div><br></div></span><div>=C2=A0I just start the mesos sla=
ves like below</div><div><br></div><div>./bin/mesos-slave.sh --master=3D<a =
href=3D"http://192.168.0.102:5050/mesos" target=3D"_blank">192.168.0.102:50=
50/mesos</a> =C2=A0--hostname=3Dslave1<br></div><div><br></div><div>If I su=
bmit the task using the above (mesos-execute) command from same as one of t=
he slave it runs on that system.</div><div><br></div><div>But when I submit=
 the task from some different system. It uses just that system and queues t=
he tasks not runs on the other slaves.</div><div>Some times I see the messa=
ge &quot;Failed to getgid: unknown user&quot;</div><div><br></div><div>Do I=
 need to start some process to push the task on all the slaves equally? Am =
I missing something here?</div><div><br></div><div>Regards,</div><div>Prade=
ep</div><div><br></div><div><br></div></div></div><div><div><div class=3D"g=
mail_extra"><br><div class=3D"gmail_quote">On 2 October 2015 at 15:07, Ondr=
ej Smola <span dir=3D"ltr">&lt;<a href=3D"mailto:ondrej.smola@gmail.com" ta=
rget=3D"_blank">ondrej.smola@gmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:=
1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left=
:1ex"><div dir=3D"ltr">Hi Pradeep,<div><br></div><div>the problem is with I=
P your slave advertise - mesos by default resolves your hostname - there ar=
e several solutions =C2=A0(let say your node ip is 192.168.56.128)</div><di=
v><br></div><div>1) =C2=A0export LIBPROCESS_IP=3D192.168.56.128</div><div>2=
) =C2=A0set mesos options - ip, hostname</div><div><br></div><div>one way t=
o do this is to create files</div><div><br></div><div>echo &quot;192.168.56=
.128&quot; &gt; /etc/mesos-slave/ip=C2=A0</div><div>echo &quot;<a href=3D"h=
ttp://abc.mesos.com" target=3D"_blank">abc.mesos.com</a>&quot; &gt; /etc/me=
sos-slave/hostname<br></div><div><br></div><div>for more configuration opti=
ons see=C2=A0<a href=3D"http://mesos.apache.org/documentation/latest/config=
uration" target=3D"_blank">http://mesos.apache.org/documentation/latest/con=
figuration</a></div><div><br></div><div><br></div><div><br></div><div><br><=
/div></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quo=
te">2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:pradeepkiruvale@gmail.com" target=3D"_blank">pradeepkiruvale@g=
mail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"marg=
in:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,20=
4);border-left-style:solid;padding-left:1ex"><div dir=3D"ltr">Hi Guangya,<d=
iv><br></div><div>Thanks for reply. I found one interesting log message.</d=
iv><div><br></div><div>=C2=A07410 master.cpp:5977] Removed slave 6a11063e-b=
8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave registered at t=
he same address<br></div><div><br></div><div>Mostly because of this issue, =
the systems/slave nodes are getting registered and de-registered to make a =
room for the next node. I can even see this on</div><div>the UI interface, =
for some time one node got added and after some time that will be replaced =
with the new slave node.</div><div><br></div><div>The above log is followed=
 by the below log messages.</div><div><br></div><div><div><br></div><div><d=
iv>I1002 10:01:12.753865 =C2=A07416 leveldb.cpp:343] Persisting action (18 =
bytes) to leveldb took 104089ns</div><div>I1002 10:01:12.753885 =C2=A07416 =
replica.cpp:679] Persisted action at 384</div><div>E1002 10:01:12.753891 =
=C2=A07417 process.cpp:1912] Failed to shutdown socket with fd 15: Transpor=
t endpoint is not connected</div><div>I1002 10:01:12.753988 =C2=A07413 mast=
er.cpp:3930] Registered slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at s=
lave(1)@<a href=3D"http://127.0.1.1:5051" target=3D"_blank">127.0.1.1:5051<=
/a> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578; ports(*):=
[31000-32000]</div><div>I1002 10:01:12.754065 =C2=A07413 master.cpp:1080] S=
lave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@<a href=3D"http:/=
/127.0.1.1:5051" target=3D"_blank">127.0.1.1:5051</a> (192.168.0.116) disco=
nnected</div><div>I1002 10:01:12.754072 =C2=A07416 hierarchical.hpp:675] Ad=
ded slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpu=
s(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )<=
/div><div>I1002 10:01:12.754084 =C2=A07413 master.cpp:2534] Disconnecting s=
lave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@<a href=3D"http:/=
/127.0.1.1:5051" target=3D"_blank">127.0.1.1:5051</a> (192.168.0.116)</div>=
<div>E1002 10:01:12.754118 =C2=A07417 process.cpp:1912] Failed to shutdown =
socket with fd 16: Transport endpoint is not connected</div><div>I1002 10:0=
1:12.754132 =C2=A07413 master.cpp:2553] Deactivating slave 6a11063e-b8ff-43=
bd-86cf-e6eef0de06fd-S62 at slave(1)@<a href=3D"http://127.0.1.1:5051" targ=
et=3D"_blank">127.0.1.1:5051</a> (192.168.0.116)</div><div>I1002 10:01:12.7=
54237 =C2=A07416 hierarchical.hpp:768] Slave 6a11063e-b8ff-43bd-86cf-e6eef0=
de06fd-S62 deactivated</div><div>I1002 10:01:12.754240 =C2=A07413 replica.c=
pp:658] Replica received learned notice for position 384</div><div>I1002 10=
:01:12.754360 =C2=A07413 leveldb.cpp:343] Persisting action (20 bytes) to l=
eveldb took 95171ns</div><div>I1002 10:01:12.754395 =C2=A07413 leveldb.cpp:=
401] Deleting ~2 keys from leveldb took 20333ns</div><div>I1002 10:01:12.75=
4406 =C2=A07413 replica.cpp:679] Persisted action at 384</div></div><div><b=
r></div><div><br></div><div>Thanks,</div><div>Pradeep</div><div><br></div><=
div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><=
div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><=
div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><=
div><br></div><div><br></div></div><div><div><div class=3D"gmail_extra"><br=
><div class=3D"gmail_quote">On 2 October 2015 at 02:35, Guangya Liu <span d=
ir=3D"ltr">&lt;<a href=3D"mailto:gyliu513@gmail.com" target=3D"_blank">gyli=
u513@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rg=
b(204,204,204);border-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><=
div><div><div>Hi Pradeep,<br><br></div>Please check some of my questions in=
 line.<br><br></div>Thanks,<br><br></div>Guangya<br><div class=3D"gmail_ext=
ra"><br><div class=3D"gmail_quote"><span>On Fri, Oct 2, 2015 at 12:55 AM, P=
radeep Kiruvale <span dir=3D"ltr">&lt;<a href=3D"mailto:pradeepkiruvale@gma=
il.com" target=3D"_blank">pradeepkiruvale@gmail.com</a>&gt;</span> wrote:<b=
r><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;borde=
r-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid=
;padding-left:1ex"><div dir=3D"ltr">Hi All,<div><br></div><div>I am new to =
Mesos. I have set up a Mesos cluster with 1 Master and 3 Slaves.</div><div>=
<br></div><div>One slave runs on the Master Node itself and Other slaves ru=
n on different nodes. Here node means the physical boxes.</div><div><br></d=
iv><div>I tried running the tasks by configuring one Node cluster. Tested t=
he task scheduling using mesos-execute, works fine.</div><div><br></div><di=
v>When I configure three Node cluster (1master and 3 slaves) and try to see=
 the resources on the master (in GUI) only the Master node resources are vi=
sible.</div><div>=C2=A0The other nodes resources are not visible. Some time=
s visible but in a de-actived state.=C2=A0</div></div></blockquote></span><=
div>Can you please append some logs from mesos-slave and mesos-master? Ther=
e should be some logs in either master or slave telling you what is wrong. =
<br></div><span><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-le=
ft-style:solid;padding-left:1ex"><div dir=3D"ltr"><div><br></div><div><b>Pl=
ease let me know what could be the reason. All the nodes are in the same ne=
twork.=C2=A0</b><br></div><div><br></div><div>When I try to schedule a task=
 using=C2=A0</div><div><br></div><div>/src/mesos-execute --master=3D<a href=
=3D"http://192.168.0.102:5050" target=3D"_blank">192.168.0.102:5050</a> --n=
ame=3D&quot;cluster-test&quot; --command=3D&quot;/usr/bin/hackbench -s 4096=
 -l 10845760 -g 2 -f 2 -P&quot; --resources=3D&quot;cpus(*):3;mem(*):2560&q=
uot;<br></div><div><br></div><div>The tasks always get scheduled on the sam=
e node. The resources from the other nodes are not getting used to schedule=
 the tasks.</div></div></blockquote></span><div>Based on your previous ques=
tion, there is only one node in your cluster, that&#39;s why other nodes ar=
e not available. We need first identify what is wrong with other three node=
s first. <br></div><span><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);=
border-left-style:solid;padding-left:1ex"><div dir=3D"ltr"><div><br></div><=
div>I<b>s it required to register the frameworks from every slave node on t=
he Master?</b></div></div></blockquote></span><div>It is not required. <br>=
</div><span><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-s=
tyle:solid;padding-left:1ex"><div dir=3D"ltr"><div><b><br></b></div><div><b=
>I have configured this cluster using the git-hub code.</b></div><div><br><=
/div><div><br></div><div>Thanks &amp; Regards,</div><div>Pradeep</div><div>=
<br></div></div>
</blockquote></span></div><br></div></div>
</blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--f46d044304d27d9c30052180faf4--