zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@apache.org>
Subject Re: RC1 issues (was: Re: [VOTE] Apache ZooKeeper release 3.5.2-alpha candidate 1)
Date Mon, 04 Jul 2016 23:24:17 GMT
I forgot to fill in the name of the test giving the connection errors below, it is testFirstServerDown
in Zookeeper_simpleSystem (TestClient.cc <http://testclient.cc/>).

-Flavio

> On 04 Jul 2016, at 23:53, Flavio Junqueira <fpj@apache.org> wrote:
> 
>> 
>> On 04 Jul 2016, at 22:01, Michael Han <hanm@cloudera.com <mailto:hanm@cloudera.com>>
wrote:
>> 
>> Both Java and C unit tests coming with 3.5.2-alpha passed for me in 5 runs.
>> Are the failed tests deterministically reproducible?
> 
> They fail consistently for me. When I run xxx, I get this output in the logs, which is
weird because it looks like the client is trying 127.0.0.1:22181 only once and after that
it only tries 127.0.0.1:22182, it sounds wrong to me:
> 
> 016-07-04 15:04:08,523:33750:ZOO_INFO@zookeeper_init_internal@1111: Initiating client
connection, host=127.0.0.1:22182,127.0.0.1:22181 sessionTimeout=10000 watcher=0x447050 sessionId=0
sessionPasswd=<null> context=0x7fff8e504910 flags=0
> 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22181]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,524:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> 2016-07-04 15:04:09,524:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket [127.0.0.1:22182]
zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> <This line keeps repeating until the test times out>
> 
> Also, if you check ZK-2463, it looks like the multi tests are failing silently. They
are timing out, but the framework isn't picking it up. I haven't had a chance to look at these
multi tests to determine whether it is timing or what.
> 
>> If not, it seems we
>> have more flaky tests related to threading / timing that needs to be taken
>> care of, and they don't sound blocker for the release to me.
>> 
> 
> From what I can tell, none of these issues are new, so I have no reason to suspect that
an issue we resolved for 3.5.2 is introducing these problems. If we are to be strict, then
we cannot release it, but I'd say we benefit from it still being alpha and proceed. We are
solving a number of issue that it is good to have out. For 3.5.3, I think we really need to
spend some time on the C client.
> 
> -Flavio 
> 
>> On Sun, Jul 3, 2016 at 9:48 PM, Rakesh Radhakrishnan <rakeshr@apache.org>
>> wrote:
>> 
>>>>> I'm suggesting as a blocker for 3.5.3, I think we should proceed with
>>> 3.5.2 as is and give some love to the C client in the next release.
>>> 
>>> Since the current release is alpha I also feel its OK to go ahead with RC1
>>> and address the C client issue in 3.5.3. That way we'll get more folks
>>> trying it out and stabilize 3.5 version eventually. Probably will listen to
>>> others opinion as well.
>>> 
>>> -Rakesh
>>> 
>>> On Mon, Jul 4, 2016 at 12:32 AM, Flavio Junqueira <fpj@apache.org> wrote:
>>> 
>>>> 
>>>>> On 03 Jul 2016, at 17:53, Chris Nauroth <cnauroth@hortonworks.com>
>>>> wrote:
>>>>> 
>>>>> For my part, I got a successful full test run from RC1 before starting
>>>> the
>>>>> [VOTE].  The problem with the silent failure of multi tests could have
>>>>> snuck past me easily though.  (Flavio, thank you for filing
>>>>> ZOOKEEPER-2463.)  I'm curious to hear test results from others who are
>>>>> trying RC1.
>>>> 
>>>> The test failures seem to be related to test timing, not bugs, but I
>>>> haven't been able to confirm for the last two I mentioned. Granted that
>>>> timing is in some sense a bug, all I'm saying is that it doesn't seem to
>>>> indicate a regression or anything.
>>>> 
>>>>> 
>>>>> It looks like we also need an issue to track updating the copyright
>>>> notice
>>>>> in the docs.  I don't believe this is an ASF compliance problem in the
>>>>> same way that an erroneous NOTICE file would be, so I propose that we
>>>>> address it in 3.5.3.
>>>> 
>>>> Agreed, we need an issue for that.
>>>> 
>>>>> 
>>>>> Flavio, you suggested filing a blocker for the ZooKeeperQuorumServer.cc
>>>>> failure.  Did you want that targeted to 3.5.2 or 3.5.3?
>>>>> 
>>>> 
>>>> I'm suggesting as a blocker for 3.5.3, I think we should proceed with
>>>> 3.5.2 as is and give some love to the C client in the next release.
>>>> 
>>>>> Overall, how are people feeling about the RC1 [VOTE] at this point? 
Is
>>>>> anyone considering a -1, or shall we proceed (keeping in mind it's an
>>>>> alpha) with the intent of fixing things in a more rapid 3.5.3 release
>>>>> cycle?
>>>> 
>>>> I'd say we proceed.
>>>> 
>>>> -Flavio
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 7/3/16, 8:43 AM, "Flavio Junqueira" <fpj@apache.org> wrote:
>>>>> 
>>>>>> The issue with the TestReconfigServer test is that the client port
is
>>>>>> still used and we get a bind exception, which prevents the server
from
>>>>>> starting. To verify this locally, I simply added some code to retry
>>> and
>>>>>> it works fine with that fix. Going forward we need a better fox.
>>>>>> 
>>>>>> I haven't able to figure out yet the issue with the
>>>>>> Zookeeper_simpleSystem tests.
>>>>>> 
>>>>>> I have also found something strange with the multi tests. I have
>>> created
>>>>>> ZK-2463 for this problem and made it a blocker for 3.5.3.
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>>> On 03 Jul 2016, at 15:25, Flavio Junqueira <fpj@apache.org>
wrote:
>>>>>>> 
>>>>>>> I have spun a new ubuntu VM to check the C failures. I get three
>>>>>>> failures with the new installation:
>>>>>>> 
>>>>>>> Zookeeper_simpleSystem::testFirstServerDown : assertion : elapsed
>>> 10911
>>>>>>> tests/TestClient.cc:411: Assertion: equality assertion failed
>>>>>>> [Expected: -101, Actual  : -4]
>>>>>>> tests/TestClient.cc:322: Assertion: assertion failed [Expression:
>>>>>>> ctx.waitForConnected(zk)]
>>>>>>> Failures !!!
>>>>>>> Run: 43   Failure total: 2   Failures: 2   Errors: 0
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> TestReconfigServer::testRemoveFollower/usr/bin/java
>>>>>>> ZooKeeper JMX enabled by default
>>>>>>> Using config: ./../../build/test/test-cppunit/conf/0.conf
>>>>>>> Starting zookeeper ... FAILED TO START
>>>>>>> zktest-mt: tests/ZooKeeperQuorumServer.cc:61: void
>>>>>>> ZooKeeperQuorumServer::start(): Assertion `system(command.c_str())
==
>>>> 0'
>>>>>>> failed.
>>>>>>> /bin/bash: line 5: 47059 Aborted                 (core dumped)
>>>>>>> ZKROOT=./../.. CLASSPATH=$CLASSPATH:$CLOVER_HOME/lib/clover.jar
>>>>>>> ${dir}$tst
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>> 
>>>>>>>> On 03 Jul 2016, at 15:19, Edward Ribeiro <edward.ribeiro@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Flavio,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Jul 3, 2016 at 5:54 AM, Flavio Junqueira <fpj@apache.org
>>>>>>>> <mailto:fpj@apache.org>> wrote:
>>>>>>>> Hey Eddie,
>>>>>>>> 
>>>>>>>> A few comments on your points:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> - the copyright notice is still dating "2008-2013". It's
worth
>>>>>>>>> updating to
>>>>>>>>> the current year?
>>>>>>>> 
>>>>>>>> Where are you seeing this? The NOTICE file is correct from
what I
>>> can
>>>>>>>> see.
>>>>>>>> 
>>>>>>>> ​Ops, sorry. I was referring to the PDFs and HTMLs in the
docs/
>>>>>>>> folder. Even after running "ant docs" the footnote has "2008-2013"
>>>>>>>> copyright. Images attached.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> - I consistently ran on an test error equals to the one
at
>>>>>>>>> https://builds.apache.org/job/ZooKeeper-trunk/2982/console
<https://builds.apache.org/job/ZooKeeper-trunk/2982/console>
>>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console
<https://builds.apache.org/job/ZooKeeper-trunk/2982/console>>
>>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console
<https://builds.apache.org/job/ZooKeeper-trunk/2982/console>
>>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console
<https://builds.apache.org/job/ZooKeeper-trunk/2982/console>>>
>>>>>>>> 
>>>>>>>> I think this is ZK-2152, which Chris has moved to 3.5.3,
so even
>>>>>>>> though it isn't ideal. it is expected.
>>>>>>>> 
>>>>>>>> ​Got it. :)
>>>>>>>> ​
>>>>>>>> 
>>>>>>>>> - Also this one:
>>>>>>>>> 
>>>>>>>>> 
>>>> https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3C <https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3C>
>>>>>>>>> 1279938263.1283.1453526737790.JavaMail.jenkins@crius%3E
>>>>>>>>> <
>>>> https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3 <https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3>
>>>>>>>>> C1279938263.1283.1453526737790.JavaMail.jenkins@crius%3E>
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't know if there is a jira for this one. If not, better
create
>>>>>>>> one and make it a blocker.
>>>>>>>> 
>>>>>>>> ​Okay, gonna look for and do this.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> - In fact, there were 14 failing tests total (I suspect
all of them
>>>>>>>>> related
>>>>>>>>> to the C tests). Any ideas? A couple of flacky tests?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> In general, having a release with so many tests failing is
bad. I
>>>>>>>> didn't get these test failures, so it would be great to report
them
>>> or
>>>>>>>> make sure that there are jiras for it.
>>>>>>>> 
>>>>>>>> ​Right. I was only skep​tical of my own tests because
I ran the unit
>>>>>>>> tests on a relatively old Ubuntu version, even though it
was Java
>>> 1.7.
>>>>>>>> So, I am running the tests on a newer Linux soon just to
make sure
>>> it
>>>>>>>> was not a false negative.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Test failures are possibly an indication that something is
bad with
>>>>>>>> the RC, so I wouldn't have +1 it if I had observed all those.
It
>>> might
>>>>>>>> be ok given that this is still labeled alpha.
>>>>>>>> 
>>>>>>>> ​Excuse me. I only +1'ed because I suspect the errors are
restricted
>>>>>>>> to the C binding and my Ubuntu version, etc. But I should
have
>>>>>>>> researched further before giving +1, nevertheless. Point
taken. :)
>>>>>>>> 
>>>>>>>> Edward
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Cheers
>> Michael.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message