cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Eriksson (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8683) Incremental repairs broken with early opening of compaction results
Date Mon, 26 Jan 2015 14:40:34 GMT


Marcus Eriksson commented on CASSANDRA-8683:

from irc:
15:27 < marcuse> belliottsmith: we create a new sstable reader over the same file, the
reference counter is not transfered to the new file, so if i bump the reference count on the
file i hold a reference to, it will not affect the latest one
15:27 < marcuse> does it make sense/
15:27 < marcuse> * transfered to the new SSTR instance
15:28 < belliottsmith> but you should have taken a reference against the one you're
using, no?
15:28 < marcuse> belliottsmith: no, we don't want to hold references during the whole
duration of the repair (hours)
15:28 < belliottsmith> each instance is distinct, but so long as you take a reference
against the one you're using, it shouldn't be a problem...
15:28 < marcuse> so, if a file is gone, we simply remove it from the set and dont anticompact
15:29 < belliottsmith> ah
15:29 < belliottsmith> hmm. that is tricky, since we need to repeatedly refetch the
set of files we want to use, since we could have new data replacing it as well, surely?
15:30 < belliottsmith> i don't really see a way around holding a reference for the lifetime
of the repair. but if we stream oldest files first we can drop references to them as we go,
15:30 < marcuse> no, for incremental repairs we repair best-effort, if a file is gone,
it means it has been compacted away and will get repaired next time around
15:31 < belliottsmith> so where/when do we grab our reference to the sstable then?
15:31 < belliottsmith> because in that case it sounds like we should simply be failing
to grab our reference count
15:31 < marcuse>
15:32 < marcuse> and yes, we probably should fail, but we don't
15:32 < belliottsmith> that code looks a bit odd. why do we check for data component
first? shouldn't an attempt to acquire the ref be sufficient? (clearly it isn't here though,
since it's not removing it from the set)
15:33 < marcuse> yes, it should, but i don't think we decrease ref count when we replace
an instance?
15:34 < marcuse> ie, when we open early and have a new instance, we should decrease
the reference on the old file
15:34 < marcuse> *old instance\
15:34 < belliottsmith> we have to
15:34 < belliottsmith> otherwise it would never get deleted
15:34 < marcuse> but its the same underlying file right? it will get deleted by the
new instance
15:34 < belliottsmith> if we don't that's definitely a bug, but i'm pretty sure we do
15:34 < belliottsmith> but lifetimes don't expire in order necessarily
15:35 < belliottsmith> so the older files ref count to zero, then remove themselves
from the linked-list of replacements
15:35 < belliottsmith> if there's nobody left, they delete the file
15:36 < belliottsmith> this all is in dire need of cleaning up in 8568

> Incremental repairs broken with early opening of compaction results
> -------------------------------------------------------------------
>                 Key: CASSANDRA-8683
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Marcus Eriksson
>            Assignee: Marcus Eriksson
>             Fix For: 2.1.3
> Incremental repairs holds a set of the sstables it started the repair on (we need to
know which sstables were actually validated to be able to anticompact them). This includes
any tmplink files that existed when the compaction started (if we wouldn't include those,
we would miss data since we move the start point of the existing non-tmplink files)
> With CASSANDRA-6916 we swap out those instances with new ones (SSTR.cloneWithNewStart
/ SSTW.openEarly), meaning that the underlying file can get deleted even though we hold a
> This causes the unit test error:
> (note that it only fails on trunk though, in 2.1 we don't hold references to the repairing
files for non-incremental repairs, but the bug should exist in 2.1 as well)

This message was sent by Atlassian JIRA

View raw message