cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10938) test_bulk_round_trip_blogposts is failing occasionally
Date Fri, 22 Jan 2016 04:59:39 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111916#comment-15111916
] 

Stefania commented on CASSANDRA-10938:
--------------------------------------

I've examined more closely the failure on Jenkins since CASSANDRA-9303 was committed and I've
noted that:

* They happen more rarely and mostly on 2.1
* The problem is only with COPY TO, not COPY FROM so we cannot reduce the ingest rate.

I've set-up an AWS box with the same specs as the ones used by Jenkings (m3.2xlarge). I've
run  {{test_bulk_round_trip_blogposts}} 50 times with no failures. There must be something
else on Jenkins boxes that causes connections to be rejected but I could not work it out.


So I decided to simulate a failed connection by setting {{native_transport_max_concurrent_connections}}
to limit the number of connections accepted by hosts. It doesn't tell us what's happening
on Jenkins but at least it allows us to test COPY TO in the face of failed connections, which
is a good thing anyway and should hopefully ensure that the Jenkins failures disappear. Note
that just stopping replicas would not have easily allowed testing this because the code selects
only replicas that are up. I've also increased the replication factor from 1 to 3 and the
nodes from 3 to 5 for {{test_bulk_round_trip_blogposts}} to give it more resilience.

I've changed the COPY TO connection logic to try multiple replicas one by one in case of failure
- previously we were giving multiple replicas to the load balancing policy but the contact
point was only the chosen replica. More importantly, if all replicas fail, instead of killing
the worker process - which would halt the entire export - we return an error for that token
- which means that the token is tried again later for up to MAXATTEMPTS times.

New test code is [here|https://github.com/stef1927/cassandra-dtest/commits/10938].

The [2.1 patch|https://github.com/stef1927/cassandra/commits/10938-2.1] is its own patch,
the [2.2 patch|https://github.com/stef1927/cassandra/commits/10938-2.2] is identical to the
2.1 patch except for a conflict with the imports and it applies cleanly upwards.

CI is still pending:

||2.1||2.2||3.0||3.3||trunk||
|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.2-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.0-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.3-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-testall/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.2-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.0-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.3-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-dtest/]|

[~pauloricardomg] could you review the python changes? Sylvan has already noted above that
the change from NBHM to CHM is fine.

> test_bulk_round_trip_blogposts is failing occasionally
> ------------------------------------------------------
>
>                 Key: CASSANDRA-10938
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10938
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: 6452.nps, 6452.png, 7300.nps, 7300a.png, 7300b.png, node1_debug.log,
node2_debug.log, node3_debug.log, recording_127.0.0.1.jfr
>
>
> We get timeouts occasionally that cause the number of records to be incorrect:
> http://cassci.datastax.com/job/trunk_dtest/858/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_blogposts/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message