cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)
Date Mon, 20 Aug 2018 12:48:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585874#comment-16585874
] 

Benedict commented on CASSANDRA-14409:
--------------------------------------

I have pushed a patch with some major refactors to {{ReplicaCollection}} [here|https://github.com/belliottsmith/cassandra/tree/14409-replicas-wip],
in that it is seemingly at parity for unit/dtests.  There are some things I would like to
improve, but we’re too pressed for time, and at the point of diminishing returns.  

The patch does include some minor behavioural changes to fix some unit tests that were broken
prior to it, and some (in progress) review feedback comments I’m collecting.

In summary, it introduces {{EndpointsForRange}}, {{EndpointsForToken}} and {{RangesAtEndpoint}}.
 There are also {{Endpoints}}, a super class of the first two, and {{AbstractReplicaCollection}}
that implements the majority of functionality.

Main improvements:
* These classes are immutable by default; I think this actually reduces copying on average,
and improves clarity
* There are {{Mutable}} variants, for clearly delineating where this is necessary
* We now declare in all locations what our expected semantics are, and we enforce them
* There is only one implementation for all non-boilerplate {{ReplicaCollection}} methods,
so a reader can quickly determine the characteristics at a call site
* (Almost) all O(n^2) call sites have been eliminated, mostly by our stronger constraints
and maintaining a {{LinkedHashMap}}.  This is eagerly constructed for a brand-new collection,
and lazily for a filtered/sorted collection.

Some open questions: 
* I have limited mutability to append only, which permits easy immutable snapshots, but may
be fiddly to change later.
* I have made the {{Mutable}} variants extend the immutable ones, but this is not necessary,
as an immutable view can always be constructed - this might be cleaner, and might also permit
easier sharing of {{Mutable}} implementation details
* {{AbstractReplicaCollection}} introduces a method called {{select()}} for efficiently getting
a subset of the collection based on a sequence of filters - this considerably clarifies one
caller, and might clarify others I didn’t spot, but is optional.
* Conversely, since we often filter and sort together, this class could be extended to perform
them together, and avoid some unnecessary work, but since this would be harder to rollback
it probably needs consensus first.
* We maintain a {{LinkedHashMap}} to avoid code bloat while enforcing the same iteration order
in our derived collections, but we could introduce an {{AbstractMap}} that proxies to a pure
{{HashMap}}, and iterates our list.  This further complicates the code (slightly), but improves
performance for sorting and building.

Suboptimality:
* {{Endpoints<?>}} is necessary in a number of places, because we conflate range and
point ops in just a handful of places; namely writing batch/hints to ‘any’ other host,
and in Data/DigestResolver, which can be for range or point queries.  This means some uglier
generics in places, and some code is less clear than it might be.  The places where this conflation
occurs are well delineated now, at least.
* We don’t have isolation when getting natural and pending replicas, so we ignore conflicting
endpoints when concatenating these.
* {{Endpoints.newMutable}} - this is an instance method for a new {{Mutable}} that matches
the type of the instance.  It’s ugly, but presently needed for the above conflation.  Possibly
the select() class could be modified to support these cases instead.
* {{EndpointsForToken}} and {{EndpointsForRange}} *require* a {{Token}} and {{Range}} respectively
- this means a bit of boilerplate occasionally to pass the correct value to empty(), as well
as some unnecessary allocations.  We could probably relax this, but it makes the semantics
of {{newMutable}} tricky (if we improved select() to replace this, we could probably avoid
this)
* {{EndpointsByRange}} and {{RangesByEndpoint}} have immutable and mutable variants, though
I don’t love these at all.  Nor do I like {{EndpointsByReplica}}.  But they will probably
suffice.

> Transient Replication: Support ring changes when transient replication is in use (add/remove
node, change RF, add/remove DC)
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14409
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Coordination, Core, Documentation and Website
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>            Priority: Major
>             Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces require streaming
and nodetool cleanup to behave differently. We already have code that does the streaming,
but in some cases we shouldn't stream any data and in others when we stream to receive data
we have to make sure we stream from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node must stay
pending until the next incremental repair completes at which point the data for that range
is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires streaming from
a full replica and is identical to how we stream from not replicated to replicated. The transition
must be managed so the transient replica is not read from as a full replica until streaming
completes. It can be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires cleanup to remove
repaired data from the transiently replicated range to reclaim space. It can be used immediately
for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup to be run
to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle these issues
as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for doing common
operations. Add/remove DC, Add/remove node(s), replace node, change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message