mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <benjamin.mah...@gmail.com>
Subject Re: Jenkins mesos plugin failing
Date Fri, 08 Nov 2013 01:23:03 GMT
>From the master's perspective, the framework disconnected immediately after
registering.

You can bump up the logging on the jenkins scheduler by ensuring that
GLOG_v=3 is in your environment when our plugin is initialized.

On Thu, Nov 7, 2013 at 3:17 PM, Whitney Sorenson <wsorenson@hubspot.com>wrote:

> Sure (https://github.com/jenkinsci/mesos-plugin/issues/4) but I'm
> actually running into another issue which I've seen before with other
> frameworks:
>
> I added the plugin to a separate Jenkins cluster and the framework doesn't
> seem to be able to maintain the connection successfully.
>
> The jenkins master log shows:
>
> Nov 7, 2013 10:12:38 PM org.jenkinsci.plugins.mesos.MesosCloud <init>
> INFO: Mesos master changed, restarting the scheduler
> Nov 7, 2013 10:12:38 PM org.jenkinsci.plugins.mesos.JenkinsScheduler <init>
> INFO: JenkinsScheduler instantiated with jenkins
> http://jenkins-master/jenkins/ and mesos mesos-master:5050
>
> With nothing else (no confirmation that the framework registered.)
>
> In the mesos UI, I see that the framework is constantly failing /
> registering. The logs show:
>
> I1107 22:53:06.791082 4283 master.cpp:1365] Framework failover timeout,
> removing framework 201310222354-1872141066-5050-4282-2992 I1107
> 22:53:06.791760 4283 master.cpp:2022] Removing framework
> 201310222354-1872141066-5050-4282-2992 I1107 22:53:06.792107 4283
> hierarchical_allocator_process.hpp:352] Removed framework
> 201310222354-1872141066-5050-4282-2992 I1107 22:53:07.788573 4286
> master.cpp:695] Registering framework
> 201310222354-1872141066-5050-4282-2993 at scheduler(1)@10.46.101.33:58478I1107 22:53:07.788938
4286 hierarchical_allocator_process.hpp:321] Added
> framework 201310222354-1872141066-5050-4282-2993 I1107 22:53:07.790592 4286
> master.cpp:1448] Sending 1 offers to framework
> 201310222354-1872141066-5050-4282-2993 I1107 22:53:07.790864 4284
> master.cpp:489] Framework 201310222354-1872141066-5050-4282-2993
> disconnected I1107 22:53:07.791007 4284 master.cpp:516] Giving framework
> 201310222354-1872141066-5050-4282-2993 0ns to failover I1107
> 22:53:07.791052 4285 hierarchical_allocator_process.hpp:397] Deactivated
> framework 201310222354-1872141066-5050-4282-2993
>
> This loop continues forever, happening several times per second.
>
> Any guidance on how to troubleshoot (I've already checked into network) or
> way to increase logging threshold on master?
>
>
>
> On Thu, Nov 7, 2013 at 5:10 PM, Benjamin Mahler <benjamin.mahler@gmail.com
> > wrote:
>
>> We should fix that so that it reconnects with Mesos after a restart of
>> Jenkins!
>>
>> Can you file an issue for this?
>>
>>
>> On Thu, Nov 7, 2013 at 12:31 PM, Whitney Sorenson <wsorenson@hubspot.com>wrote:
>>
>>> I should also point out the scheduler didn't seem to survive a reboot of
>>> Jenkins - I had to delete the mesos cloud and reenter the parameters.
>>>
>>>
>>> On Thu, Nov 7, 2013 at 3:26 PM, Whitney Sorenson <wsorenson@hubspot.com>wrote:
>>>
>>>> Looks like we're using authentication on our slaves. So you either need
>>>> to pass
>>>>
>>>> -jnlpCredentials user:pass
>>>>
>>>> on the command line, or change around the permissions in Jenkins to
>>>> allow anonymous users to connect/run jobs.
>>>>
>>>> I'm not sure if it would make sense or not to add the user/pass in the
>>>> Jenkins plugin configuration screen or if it should be fetched another way.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Nov 7, 2013 at 2:52 PM, Vinod Kone <vinodkone@gmail.com> wrote:
>>>>
>>>>> Great. Let us know once you figure it out. Maybe I can add a FAQ to
>>>>> the plugin's README to help others (or you can contribute too :)).
>>>>>
>>>>>
>>>>> On Thu, Nov 7, 2013 at 11:40 AM, Whitney Sorenson <
>>>>> wsorenson@hubspot.com> wrote:
>>>>>
>>>>>> I added the jenkins user on the slave - this was the missing piece.
>>>>>> I'll add this to my PR for the readme. Got much further now; now
I'm
>>>>>> getting a 403 on the fetch:
>>>>>>
>>>>>> /jenkins/computer/mesos-jenkins-6f4719c8-1c61-4b28-b5ab-ba298e846840/slave-agent.jnlp:
>>>>>> 403 Forbidden at
>>>>>> hudson.remoting.Launcher.parseJnlpArguments(Launcher.java:261) at
>>>>>> hudson.remoting.Launcher.run(Launcher.java:215)
>>>>>>
>>>>>> and corresponding log on jenkins master:
>>>>>>
>>>>>> Nov 7, 2013 2:38:39 PM winstone.Logger logInternal INFO: While
>>>>>> serving
>>>>>> http://localhost:8080/jenkins/computer/mesos-jenkins-6f4719c8-1c61-4b28-b5ab-ba298e846840/slave-agent.jnlp:
>>>>>> hudson.security.AccessDeniedException2: anonymous is missing the
>>>>>> Slave/Connect permission
>>>>>>
>>>>>> Going to look into what this means.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Nov 7, 2013 at 2:21 PM, Vinod Kone <vinodkone@gmail.com>wrote:
>>>>>>
>>>>>>> I looked at the code and it looks there are few places the executor
>>>>>>> might fail before it fetches the URI. Most of them have to do
with
>>>>>>> incorrect permissions. The code was written to have any errors
reported
>>>>>>> either in slave log or console or executor logs (there might
be a bug here
>>>>>>> if we are in fact swallowing errors). IIUC, the executor log
directory is
>>>>>>> empty in your case which suggests the executor died before it
could even
>>>>>>> create "stdout" or "stderr" files in its sandbox (Is this true?).
>>>>>>>
>>>>>>> Couple of questions:
>>>>>>>
>>>>>>> What user is Jenkins master running as? Is that user known to
the
>>>>>>> host on which mesos slave is running?
>>>>>>>
>>>>>>> How are you starting the mesos slave (e.g., cmd line flags)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 7, 2013 at 11:00 AM, Whitney Sorenson <
>>>>>>> wsorenson@hubspot.com> wrote:
>>>>>>>
>>>>>>>> The gist was compiled from that log. Here is the complete
log from
>>>>>>>> toggling the jenkins plugin on / off (you see the ping statements
>>>>>>>> inbetween):
>>>>>>>>
>>>>>>>> https://gist.github.com/wsorenson/8bf64e44fd42da354fa0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 7, 2013 at 1:57 PM, Vinod Kone <vinodkone@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> What does mesos-slave.err say?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 7, 2013 at 10:49 AM, Whitney Sorenson <
>>>>>>>>> wsorenson@hubspot.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vinod,
>>>>>>>>>>
>>>>>>>>>> It's 0.14.0-rc4 in both.
>>>>>>>>>>
>>>>>>>>>> I believe we have logging working:
>>>>>>>>>>
>>>>>>>>>> -rw-r--r-- 1 root root         0 Oct 22 23:48 mesos-slave.out
>>>>>>>>>> lrwxrwxrwx 1 root root        63 Oct 22 23:48 mesos-slave.INFO
->
>>>>>>>>>> mesos-slave.carousel.invalid-user.log.INFO.20131022-234823.5797
>>>>>>>>>> lrwxrwxrwx 1 root root        66 Oct 22 23:49 mesos-slave.WARNING
>>>>>>>>>> -> mesos-slave.carousel.invalid-user.log.WARNING.20131022-234954.5797
>>>>>>>>>> drwxr-xr-x 2 root root      4096 Oct 22 23:49 .
>>>>>>>>>> -rw-rw-r-- 1 root root      4827 Nov  1 20:34
>>>>>>>>>> mesos-slave.carousel.invalid-user.log.WARNING.20131022-234954.5797
>>>>>>>>>> -rw-rw-r-- 1 root root  10408140 Nov  7 18:44
>>>>>>>>>> mesos-slave.carousel.invalid-user.log.INFO.20131022-234823.5797
>>>>>>>>>> -rw-r--r-- 1 root root  53759705 Nov  7 18:45 mesos-slave.err
>>>>>>>>>>
>>>>>>>>>> Is there something else to check? Is it possible
the executor is
>>>>>>>>>> failing before it even attempts to fetch URIs?
>>>>>>>>>>
>>>>>>>>>> Ray - Thanks - yeah I found the jenkins logs. I was
able to wget
>>>>>>>>>> the slave.jar, and even run it. The mesos-jenkins
slaves are dead now, so I
>>>>>>>>>> can't connect to their slave-agent - but the jar
does run. Not sure if the
>>>>>>>>>> window for trying to connect to one of the mesos
launched slaves is long
>>>>>>>>>> enough to try before it is terminated due to failures.
Interestingly, when
>>>>>>>>>> I try to connect to one of the existing slaves I
get a 403.
>>>>>>>>>>
>>>>>>>>>> -Whitney
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 7, 2013 at 1:34 PM, Vinod Kone <vinodkone@gmail.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Whitney,
>>>>>>>>>>>
>>>>>>>>>>> What version of mesos are you using (both in
the cluster and the
>>>>>>>>>>> plugin)?
>>>>>>>>>>>
>>>>>>>>>>> The slave should print stuff to console when
it is launching
>>>>>>>>>>> executor (e.g., "Fetching resources..."). I don't
see that in the gist you
>>>>>>>>>>> pasted. Are you capturing stdout/stderr of the
slave?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 7, 2013 at 10:30 AM, Whitney Sorenson
<
>>>>>>>>>>> wsorenson@hubspot.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Ray.
>>>>>>>>>>>>
>>>>>>>>>>>> I have very similar issue (empty executor
directories) - but
>>>>>>>>>>>> don't have any issues curling the slave.jar
URI - and I don't have any
>>>>>>>>>>>> existing JNLP process running. I don't have
a jenkins user - is that the
>>>>>>>>>>>> only setup you did on the slave?
>>>>>>>>>>>>
>>>>>>>>>>>> -Whitney
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 7, 2013 at 1:16 PM, Ray Rodriguez
<
>>>>>>>>>>>> rayrod2030@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Whitney I would have a look at this
github issue where I
>>>>>>>>>>>>> work through some of my jenkins mesos-plugin
issues with Vinod.  Might be
>>>>>>>>>>>>> some of the same issues you are seeing.
>>>>>>>>>>>>> https://github.com/jenkinsci/mesos-plugin/issues/2
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ray
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 7, 2013 at 1:07 PM, Whitney
Sorenson <
>>>>>>>>>>>>> wsorenson@hubspot.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi all!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am trying to get the Jenkins Mesos
plugin functioning. I
>>>>>>>>>>>>>> was able to get it installed on our
Jenkins master.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, it's unclear if there are
any required steps for
>>>>>>>>>>>>>> setting up the slaves. When a framework
task is launched, it fails
>>>>>>>>>>>>>> instantly and there are no logs in
the runs folder.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's a gist with relevant logs
from the slave:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://gist.github.com/wsorenson/b3562c3e4a8992f9a46f/raw/ea5821c442d826456291330452208d8d7ac8418f/failing+jenkins+logs
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any help on how to debug? At first,
I thought maybe we needed
>>>>>>>>>>>>>> slave.jar or something but it looks
like it's trying to fetch that from the
>>>>>>>>>>>>>> master using the URIs. To clarify,
I have done no special jenkins related
>>>>>>>>>>>>>> setup (as per readme.md) on any of
the slaves.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Whitney
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message