zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@apache.org>
Subject Re: RC1 issues (was: Re: [VOTE] Apache ZooKeeper release 3.5.2-alpha candidate 1)
Date Mon, 04 Jul 2016 22:53:25 GMT

> On 04 Jul 2016, at 22:01, Michael Han <hanm@cloudera.com> wrote:
> 
> Both Java and C unit tests coming with 3.5.2-alpha passed for me in 5 runs.
> Are the failed tests deterministically reproducible?

They fail consistently for me. When I run xxx, I get this output in the logs, which is weird
because it looks like the client is trying 127.0.0.1:22181 only once and after that it only
tries 127.0.0.1:22182, it sounds wrong to me:

016-07-04 15:04:08,523:33750:ZOO_INFO@zookeeper_init_internal@1111: Initiating client connection,
host=127.0.0.1:22182,127.0.0.1:22181 sessionTimeout=10000 watcher=0x447050 sessionId=0 sessionPasswd=<null>
context=0x7fff8e504910 flags=0
2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22181]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,524:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2016-07-04 15:04:09,524:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
<This line keeps repeating until the test times out>

Also, if you check ZK-2463, it looks like the multi tests are failing silently. They are timing
out, but the framework isn't picking it up. I haven't had a chance to look at these multi
tests to determine whether it is timing or what.

> If not, it seems we
> have more flaky tests related to threading / timing that needs to be taken
> care of, and they don't sound blocker for the release to me.
> 

From what I can tell, none of these issues are new, so I have no reason to suspect that an
issue we resolved for 3.5.2 is introducing these problems. If we are to be strict, then we
cannot release it, but I'd say we benefit from it still being alpha and proceed. We are solving
a number of issue that it is good to have out. For 3.5.3, I think we really need to spend
some time on the C client.

-Flavio 

> On Sun, Jul 3, 2016 at 9:48 PM, Rakesh Radhakrishnan <rakeshr@apache.org>
> wrote:
> 
>>>> I'm suggesting as a blocker for 3.5.3, I think we should proceed with
>> 3.5.2 as is and give some love to the C client in the next release.
>> 
>> Since the current release is alpha I also feel its OK to go ahead with RC1
>> and address the C client issue in 3.5.3. That way we'll get more folks
>> trying it out and stabilize 3.5 version eventually. Probably will listen to
>> others opinion as well.
>> 
>> -Rakesh
>> 
>> On Mon, Jul 4, 2016 at 12:32 AM, Flavio Junqueira <fpj@apache.org> wrote:
>> 
>>> 
>>>> On 03 Jul 2016, at 17:53, Chris Nauroth <cnauroth@hortonworks.com>
>>> wrote:
>>>> 
>>>> For my part, I got a successful full test run from RC1 before starting
>>> the
>>>> [VOTE].  The problem with the silent failure of multi tests could have
>>>> snuck past me easily though.  (Flavio, thank you for filing
>>>> ZOOKEEPER-2463.)  I'm curious to hear test results from others who are
>>>> trying RC1.
>>> 
>>> The test failures seem to be related to test timing, not bugs, but I
>>> haven't been able to confirm for the last two I mentioned. Granted that
>>> timing is in some sense a bug, all I'm saying is that it doesn't seem to
>>> indicate a regression or anything.
>>> 
>>>> 
>>>> It looks like we also need an issue to track updating the copyright
>>> notice
>>>> in the docs.  I don't believe this is an ASF compliance problem in the
>>>> same way that an erroneous NOTICE file would be, so I propose that we
>>>> address it in 3.5.3.
>>> 
>>> Agreed, we need an issue for that.
>>> 
>>>> 
>>>> Flavio, you suggested filing a blocker for the ZooKeeperQuorumServer.cc
>>>> failure.  Did you want that targeted to 3.5.2 or 3.5.3?
>>>> 
>>> 
>>> I'm suggesting as a blocker for 3.5.3, I think we should proceed with
>>> 3.5.2 as is and give some love to the C client in the next release.
>>> 
>>>> Overall, how are people feeling about the RC1 [VOTE] at this point?  Is
>>>> anyone considering a -1, or shall we proceed (keeping in mind it's an
>>>> alpha) with the intent of fixing things in a more rapid 3.5.3 release
>>>> cycle?
>>> 
>>> I'd say we proceed.
>>> 
>>> -Flavio
>>> 
>>>> 
>>>> 
>>>> 
>>>> On 7/3/16, 8:43 AM, "Flavio Junqueira" <fpj@apache.org> wrote:
>>>> 
>>>>> The issue with the TestReconfigServer test is that the client port is
>>>>> still used and we get a bind exception, which prevents the server from
>>>>> starting. To verify this locally, I simply added some code to retry
>> and
>>>>> it works fine with that fix. Going forward we need a better fox.
>>>>> 
>>>>> I haven't able to figure out yet the issue with the
>>>>> Zookeeper_simpleSystem tests.
>>>>> 
>>>>> I have also found something strange with the multi tests. I have
>> created
>>>>> ZK-2463 for this problem and made it a blocker for 3.5.3.
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>>> On 03 Jul 2016, at 15:25, Flavio Junqueira <fpj@apache.org>
wrote:
>>>>>> 
>>>>>> I have spun a new ubuntu VM to check the C failures. I get three
>>>>>> failures with the new installation:
>>>>>> 
>>>>>> Zookeeper_simpleSystem::testFirstServerDown : assertion : elapsed
>> 10911
>>>>>> tests/TestClient.cc:411: Assertion: equality assertion failed
>>>>>> [Expected: -101, Actual  : -4]
>>>>>> tests/TestClient.cc:322: Assertion: assertion failed [Expression:
>>>>>> ctx.waitForConnected(zk)]
>>>>>> Failures !!!
>>>>>> Run: 43   Failure total: 2   Failures: 2   Errors: 0
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> TestReconfigServer::testRemoveFollower/usr/bin/java
>>>>>> ZooKeeper JMX enabled by default
>>>>>> Using config: ./../../build/test/test-cppunit/conf/0.conf
>>>>>> Starting zookeeper ... FAILED TO START
>>>>>> zktest-mt: tests/ZooKeeperQuorumServer.cc:61: void
>>>>>> ZooKeeperQuorumServer::start(): Assertion `system(command.c_str())
==
>>> 0'
>>>>>> failed.
>>>>>> /bin/bash: line 5: 47059 Aborted                 (core dumped)
>>>>>> ZKROOT=./../.. CLASSPATH=$CLASSPATH:$CLOVER_HOME/lib/clover.jar
>>>>>> ${dir}$tst
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>> 
>>>>>>> On 03 Jul 2016, at 15:19, Edward Ribeiro <edward.ribeiro@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi Flavio,
>>>>>>> 
>>>>>>> 
>>>>>>> On Sun, Jul 3, 2016 at 5:54 AM, Flavio Junqueira <fpj@apache.org
>>>>>>> <mailto:fpj@apache.org>> wrote:
>>>>>>> Hey Eddie,
>>>>>>> 
>>>>>>> A few comments on your points:
>>>>>>> 
>>>>>>>> 
>>>>>>>> - the copyright notice is still dating "2008-2013". It's
worth
>>>>>>>> updating to
>>>>>>>> the current year?
>>>>>>> 
>>>>>>> Where are you seeing this? The NOTICE file is correct from what
I
>> can
>>>>>>> see.
>>>>>>> 
>>>>>>> ​Ops, sorry. I was referring to the PDFs and HTMLs in the docs/
>>>>>>> folder. Even after running "ant docs" the footnote has "2008-2013"
>>>>>>> copyright. Images attached.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> - I consistently ran on an test error equals to the one at
>>>>>>>> https://builds.apache.org/job/ZooKeeper-trunk/2982/console
>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console>
>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console
>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console>>
>>>>>>> 
>>>>>>> I think this is ZK-2152, which Chris has moved to 3.5.3, so even
>>>>>>> though it isn't ideal. it is expected.
>>>>>>> 
>>>>>>> ​Got it. :)
>>>>>>> ​
>>>>>>> 
>>>>>>>> - Also this one:
>>>>>>>> 
>>>>>>>> 
>>> https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3C
>>>>>>>> 1279938263.1283.1453526737790.JavaMail.jenkins@crius%3E
>>>>>>>> <
>>> https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3
>>>>>>>> C1279938263.1283.1453526737790.JavaMail.jenkins@crius%3E>
>>>>>>>> 
>>>>>>> 
>>>>>>> I don't know if there is a jira for this one. If not, better
create
>>>>>>> one and make it a blocker.
>>>>>>> 
>>>>>>> ​Okay, gonna look for and do this.
>>>>>>> 
>>>>>>> 
>>>>>>>> - In fact, there were 14 failing tests total (I suspect all
of them
>>>>>>>> related
>>>>>>>> to the C tests). Any ideas? A couple of flacky tests?
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> In general, having a release with so many tests failing is bad.
I
>>>>>>> didn't get these test failures, so it would be great to report
them
>> or
>>>>>>> make sure that there are jiras for it.
>>>>>>> 
>>>>>>> ​Right. I was only skep​tical of my own tests because I ran
the unit
>>>>>>> tests on a relatively old Ubuntu version, even though it was
Java
>> 1.7.
>>>>>>> So, I am running the tests on a newer Linux soon just to make
sure
>> it
>>>>>>> was not a false negative.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Test failures are possibly an indication that something is bad
with
>>>>>>> the RC, so I wouldn't have +1 it if I had observed all those.
It
>> might
>>>>>>> be ok given that this is still labeled alpha.
>>>>>>> 
>>>>>>> ​Excuse me. I only +1'ed because I suspect the errors are restricted
>>>>>>> to the C binding and my Ubuntu version, etc. But I should have
>>>>>>> researched further before giving +1, nevertheless. Point taken.
:)
>>>>>>> 
>>>>>>> Edward
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Cheers
> Michael.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message