cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua McKenzie (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12236) RTE from new CDC column breaks in flight queries.
Date Fri, 22 Jul 2016 12:58:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389445#comment-15389445
] 

Joshua McKenzie commented on CASSANDRA-12236:
---------------------------------------------

Current status: Not entirely sure what to make of that [upgrade test run|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/].

65 failures out of ~1350 tests runs, so if the driver was bailing on the null in cdc I'd expect
we'd see far more than that. The errors I'm seeing are all over the map though, and I don't
know how many upgrade test errors we "expect" at this point (3.7 upgrade job not on cassci
that I'm seeing):
* [Timeouts|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.paging_test/TestPagingDataNodes3RF3_Upgrade_current_3_0_x_To_indev_3_x/basic_paging_test/]
* [LegacyPagedRangeCommandSerializer.deserialize assertions|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.paging_test/TestPagingDataNodes3RF3_Upgrade_current_3_0_x_To_indev_3_x/basic_paging_test_2/]
- see CASSANDRA-12249. I haven't dug deeply into that code but from initially looking into
it, I'm not sure how an added column in schema would lead to us sending a deprecated PAGED_RANGE
from a 3.8 to a 3.0.x node. That being said, I don't see any "guards" in general around a
PartitionRangedReadCommand.createMessage with a paging data range, and that predated the changes
in CASSANDRA-11393 so that would require more inspection to figure out what's going on.
* [Secondary index paging timeouts|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.paging_test/TestPagingDataNodes3RF3_Upgrade_current_3_x_To_indev_3_x/test_paging_using_secondary_indexes/]
* [Failure to find unrelated columns|http://cassci.datastax.com/view/Dev/view/knifewine/job/knifewine-joshupgrade12236-upgrade/3/testReport/junit/upgrade_tests.cql_tests/TestCQLNodes2RF1_Upgrade_current_3_0_x_To_indev_3_x/select_with_alias_test/]

As for how we want to proceed from here: I'd say we a) re-run the upgrade jobs to see if timeouts
were flaky environment (had a lot of problems with that yesterday across a lot of jobs, b)
commit this change to 3.8/3.9/trunk, and c) Start working the CASSANDRA-12249 angle since
that error showed up considerably more frequently than any other single error in the upgrade
test runs I saw.

As for what this means for the 3.8 release, my .02 is that I'd want to delta it against what
upgrade tests looked like for 3.6, 3.4, 3.2. This is an even release, meaning we don't recommend
rolling it out in production, and as long as our load of upgrade test failures for 3.8 isn't
a regression from the load we had for 3.6, I'd say we move forward, potentially even before
hammering out CASSANDRA-12249. Currently even releases are "feature" releases and odd are
"stable", so there's no real need to hold up an even release for upgrade-only, mixed-version
specific cluster problems in my opinion.

> RTE from new CDC column breaks in flight queries.
> -------------------------------------------------
>
>                 Key: CASSANDRA-12236
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12236
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jeremiah Jordan
>            Assignee: Joshua McKenzie
>             Fix For: 3.x
>
>         Attachments: 12236.txt
>
>
> This RTE is not harmless. It will cause the internode connection to break which will
cause all in flight requests between these nodes to die/timeout.
> {noformat}
>     - Due to changes in schema migration handling and the storage format after 3.0, you
will
>       see error messages such as:
>          "java.lang.RuntimeException: Unknown column cdc during deserialization"
>       in your system logs on a mixed-version cluster during upgrades. This error message
>       is harmless and due to the 3.8 nodes having cdc added to their schema tables while
>       the <3.8 nodes do not. This message should cease once all nodes are upgraded
to 3.8.
>       As always, refrain from schema changes during cluster upgrades.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message