cloudstack-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Yadav <rohit.ya...@shapeblue.com>
Subject RE: Test failure on master?
Date Thu, 12 May 2016 06:04:53 GMT
The issue is not specific to the change but the how test are run.
On OSX, when a process tries to listen on a port the firewall (enabled by default on most
cases) would block the process from running.
The other issue is the test relies on wall clock which can depend if the test/CI env was virtualized
and fail if threads/scheduling did not allocate enough CPU bursts.

The new PR reduces the test count to 2, this makes sure that only 2 malicious clients are
created that will thereotically block only for 30s max. (15s each, the Link class has hard
limit of failing SSL handshakes in 15 seconds).

In case of Simon's environment, we'll need to verify the root cause -- whether the patch was
successfully applies (there were two PRs) and applied to both mgmt. server and KVM agent.

Finally, I've not seen any failures on Jenkins and Travis by NioTest but let me know if you've
seen something or can point to a failing test.

Regards.

Regards,

Rohit Yadav

rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

-----Original Message-----
From: williamstevens@gmail.com [mailto:williamstevens@gmail.com] On Behalf Of Will Stevens
Sent: Wednesday, May 11, 2016 7:48 PM
To: dev@cloudstack.apache.org; Simon Weller <sweller@ena.com>
Subject: Re: Test failure on master?

Rohit, I have seen quite a few issues with this feature so far.  The change you made in #1538
does not change the actual code at all, it just reduces the number of tests, so you are less
likely to run into the problem, but the problem still exists.

I am CCing in Simon Weller as well.  I was talking to him this morning and he had this to
say (unprompted).

Will, We're still seeing odd issues with that NIO SSL concurrency patch
> (1493), even after pulling in the additional PR 1534. The latest 
> problem we've seen is 100% cpu on the agents for no apparent reason. I 
> reverted both patches from our QA lab this morning and the problem has gone away.


I pulled it into a second lab where we have haproxy setup to load balance
> and the same behaviour occurs


top - 08:18:15 up 1 day, 17:08,  5 users,  load average: 1.92, 2.22, 2.09
> Tasks: 223 total,   1 running, 222 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 22.2 us, 11.9 sy,  0.0 ni, 65.8 id,  0.0 wa,  0.0 hi,  0.1 
> si,
>  0.0 st
> KiB Mem : 32673608 total, 28312176 free,  3512104 used,   849328 buff/cache
> KiB Swap:  4194300 total,  4194300 free,        0 used. 28757568 avail Mem
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
>
> 17985 root      20   0 6937720 162816  22196 S 100.3  0.5   3:24.84
> /usr/lib/jvm/jre/bin/java -Xms256m -Xmx2048m -cp 
> /usr/share/cloudstack-agent/lib/activatio+
> 15587 root      20   0 1733288 375976  12164 S 100.0  1.2  10:42.36
> /usr/libexec/qemu-kvm -name v-46-VM -S -machine 
> pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 1+
>  4480 root      20   0  909604 305292  12264 S   0.7  0.9   1:10.21
> /usr/libexec/qemu-kvm -name r-44-VM -S -machine 
> pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 2+
>  5188 root      20   0  957548 323420  12216 S   0.7  1.0   1:07.35
> /usr/libexec/qemu-kvm -name r-45-VM -S -machine 
> pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 2+
> 18336 root      20   0  157840   2392   1556 R   0.7  0.0   0:00.14 top
>
>
> 19023 root      20   0 1002156 449720  12372 S   0.7  1.4  10:57.69
> /usr/libexec/qemu-kvm -name r-32-VM -S -machine 
> pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 2+


 I am considering reverting this feature (both PRs) until we can understand what is causing
this and we can stabilize this code so it does not cause us problems.  With this type of behavior,
I am not confident with this code in production right now...

*Will STEVENS*
Lead Developer

*CloudOps* *| *Cloud Solutions Experts
420 rue Guy *|* Montreal *|* Quebec *|* H3J 1S6 w cloudops.com *|* tw @CloudOps_

On Wed, May 11, 2016 at 5:36 AM, Rohit Yadav <rohit.yadav@shapeblue.com>
wrote:

> Please follow up on PR #1538 and comment if that fixes the issue on OSX.
>
> Regards.
>
> Regards,
>
> Rohit Yadav
>
> rohit.yadav@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>
> -----Original Message-----
> From: Rohit Yadav [mailto:rohit.yadav@shapeblue.com]
> Sent: Wednesday, May 11, 2016 2:49 PM
> To: dev@cloudstack.apache.org
> Subject: RE: Test failure on master?
>
> I don't have OSX, but it seems to be working on Travis and Linux env 
> in general.
> I'll send a PR that relaxes malicious client attacks, and ask you to 
> review in your env -- Koushik and Mike.
>
> Regards,
>
> Rohit Yadav
>
> rohit.yadav@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>
> -----Original Message-----
> From: Koushik Das [mailto:koushik.das@accelerite.com]
> Sent: Wednesday, May 11, 2016 12:22 PM
> To: dev@cloudstack.apache.org
> Subject: Re: Test failure on master?
>
> I am also seeing the same failure happening randomly. OS X El Capitan 
> 10.11.4.
>
> Results :
>
> Tests in error:
>   NioTest.testConnection:152 > TestTimedOut test timed out after 60000 
> milliseco...
>
> Tests run: 200, Failures: 0, Errors: 1, Skipped: 13
>
>
> ________________________________________
> From: Tutkowski, Mike <Mike.Tutkowski@netapp.com>
> Sent: Tuesday, May 10, 2016 6:31:23 PM
> To: dev@cloudstack.apache.org
> Subject: Re: Test failure on master?
>
> Oh, and it's the OS of my MacBook Pro.
>
> > On May 10, 2016, at 6:59 AM, Tutkowski, Mike 
> > <Mike.Tutkowski@netapp.com>
> wrote:
> >
> > Hi,
> >
> > The environment is Mac OS X El Capitan 10.11.4.
> >
> > Thanks!
> > Mike
> >
> >> On May 10, 2016, at 5:51 AM, Will Stevens <wstevens@cloudops.com>
> wrote:
> >>
> >> I think I can verify that this is still happening on master for him 
> >> because you changed the timeout (and the number of tests run, etc) 
> >> when you pushed the fix in #1534.  So by looking at the timeout of 
> >> 60000, we can verify that it is the latest code from master being run.
> >>
> >> I do think we need to revisit this to make sure we don't have 
> >> intermittent issues with this test.
> >>
> >> Thx guys...
> >>
> >> *Will STEVENS*
> >> Lead Developer
> >>
> >> *CloudOps* *| *Cloud Solutions Experts
> >> 420 rue Guy *|* Montreal *|* Quebec *|* H3J 1S6 w cloudops.com *|* 
> >> tw @CloudOps_
> >>
> >> On Tue, May 10, 2016 at 7:41 AM, Rohit Yadav 
> >> <rohit.yadav@shapeblue.com>
> >> wrote:
> >>
> >>> Mike,
> >>>
> >>> Can you comment if you're using latest master. Can you also share 
> >>> the environment where you're running this (in a VM, automated by 
> >>> Jenkins, Java version etc)?
> >>>
> >>> Will - I think the issue should be fixed on latest master, but if 
> >>> Mike and others are getting failures I can further relax the test.
> >>> In virtualized environments, there may be threading/scheduling issues.
> >>>
> >>> Regards,
> >>> Rohit Yadav
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Rohit Yadav
> >>>
> >>> rohit.yadav@shapeblue.com
> >>> www.shapeblue.com
> >>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue On 
> >>> May 10 2016, at 3:20 am, Will Stevens <wstevens@cloudops.com> wrote:
> >>>
> >>> Rohit, can you look into this.
> >>>
> >>> It was first introduced in:
> >>> https://github.com/apache/cloudstack/pull/1493
> >>>
> >>> I thought the problem was fixed with this:
> >>> https://github.com/apache/cloudstack/pull/1534
> >>>
> >>> Apparently we still have a problem. This is intermittently 
> >>> emitting false negatives from what I can tell...
> >>>
> >>> *Will STEVENS*
> >>> Lead Developer
> >>>
> >>> *CloudOps* *| *Cloud Solutions Experts
> >>> 420 rue Guy *|* Montreal *|* Quebec *|* H3J 1S6 w cloudops.com *|* 
> >>> tw @CloudOps_
> >>>
> >>> On Mon, May 9, 2016 at 5:34 PM, Tutkowski, Mike 
> >>> <Mike.Tutkowski@netapp.com
> >>> wrote:
> >>>
> >>>> ?Hi,
> >>>>
> >>>>
> >>>> I've seen this a couple times today.
> >>>>
> >>>>
> >>>> Is this a known issue?
> >>>>
> >>>>
> >>>> Results :
> >>>>
> >>>>
> >>>> Tests in error:
> >>>>
> >>>> NioTest.testConnection:152 > TestTimedOut test timed out after
> >>>> 60000 milliseco...
> >>>>
> >>>>
> >>>> Tests run: 200, Failures: 0, Errors: 1, Skipped: 13
> >>>>
> >>>>
> >>>> [INFO]
> >>>> -----------------------------------------------------------------
> >>>> --
> >>>> -----
> >>>>
> >>>> [INFO] Reactor Summary:
> >>>>
> >>>> [INFO]
> >>>>
> >>>> [INFO] Apache CloudStack Developer Tools - Checkstyle 
> >>>> Configuration SUCCESS [ 1.259 s]
> >>>>
> >>>> [INFO] Apache CloudStack .................................. 
> >>>> SUCCESS [
> >>>> 1.858 s]
> >>>>
> >>>> [INFO] Apache CloudStack Maven Conventions Parent ......... 
> >>>> SUCCESS [
> >>>> 1.528 s]
> >>>>
> >>>> [INFO] Apache CloudStack Framework - Managed Context ...... 
> >>>> SUCCESS [
> >>>> 4.882 s]
> >>>>
> >>>> [INFO] Apache CloudStack Utils ............................ 
> >>>> FAILURE
> >>> [01:20
> >>>> min]??
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Mike
> >>>
>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which 
> is the property of Accelerite, a Persistent Systems business. It is 
> intended only for the use of the individual or entity to which it is 
> addressed. If you are not the intended recipient, you are not 
> authorized to read, retain, copy, print, distribute or use this 
> message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Accelerite, a 
> Persistent Systems business does not accept any liability for virus infected mails.
>
Mime
View raw message