cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly
Date Fri, 24 Jun 2011 07:56:47 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054297#comment-13054297
] 

Sylvain Lebresne commented on CASSANDRA-2816:
---------------------------------------------

I'm not sure what you mean by "snapshotting immediately" or "polishing our snapshot support",
but one approach that I think is equivalent to that (or maybe that is what you meant by 'snapshotting')
would be to grab references to the sstables at the very beginning for each request and use
those all throughout the repair. This has however a problem: this means we retain sstables
from being deleted during repair, including sstables that are compacted in the meantime. Because
repair can take a while, this will be bad. This will also require changes to the wire protocol
(because we'll need a way to indicate during streaming the set of sstables to consider), and
since we've kind of decided to not do that in minor releases (at least until we've discussed
that), this means this cannot be released quickly. Which is bad, because I'm pretty sure this
is a good part of the reason why some people with big data sets have had huge pain with repair.

Scheduling the validation one by one avoids those problems. In theory this means we'll do
less work in parallel, but in practice I doubt this is a big since the goal is probably to
have repair have less impact on the node rather than more. It will also make this more easy
to reason about.

> Repair doesn't synchronize merkle tree creation properly
> --------------------------------------------------------
>
>                 Key: CASSANDRA-2816
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>
> Being a little slow, I just realized after having opened CASSANDRA-2811 and CASSANDRA-2815
that there is a more general problem with repair.
> When a repair is started, it will send a number of merkle tree to its neighbor as well
as himself and assume for correction that the building of those trees will be started on every
node roughly at the same time (if not, we end up comparing data snapshot at different time
and will thus mistakenly repair a lot of useless data). This is bogus for many reasons:
> * Because validation compaction runs on the same executor that other compaction, the
start of the validation on the different node is subject to other compactions. 0.8 mitigates
this in a way by being multi-threaded (and thus there is less change to be blocked a long
time by a long running compaction), but the compaction executor being bounded, its still a
problem)
> * if you run a nodetool repair without arguments, it will repair every CFs. As a consequence
it will generate lots of merkle tree requests and all of those requests will be issued at
the same time. Because even in 0.8 the compaction executor is bounded, some of those validations
will end up being queued behind the first ones. Even assuming that the different validation
are submitted in the same order on each node (which isn't guaranteed either), there is no
guarantee that on all nodes, the first validation will take the same time, hence desynchronizing
the queued ones.
> Overall, it is important for the precision of repair that for a given CF and range (which
is the unit at which trees are computed), we make sure that all node will start the validation
at the same time (or, since we can't do magic, as close as possible).
> One (reasonably simple) proposition to fix this would be to have repair schedule validation
compactions across nodes one by one (i.e, one CF/range at a time), waiting for all nodes to
return their tree before submitting the next request. Then on each node, we should make sure
that the node will start the validation compaction as soon as requested. For that, we probably
want to have a specific executor for validation compaction and:
> * either we fail the whole repair whenever one node is not able to execute the validation
compaction right away (because no thread are available right away).
> * we simply tell the user that if he start too many repairs in parallel, he may start
seeing some of those repairing more data than it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message