cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eugene miretsky <eugene.miret...@gmail.com>
Subject Re: How do TTLs generate tombstones
Date Tue, 31 Oct 2017 12:39:51 GMT
Thanks,

We have turned off read repair, and read with consistency = one. This
leaves repairs and old timestamps (generate by the client) as possible
causes for the overlap. We are writing from Spark, and don't have NTP set
up on the cluster - I think that was causing some of the issues, but we
have fixed it, and the problem remains.

It is hard for me to believe the C* repair has a bug, so before creating a
JIRA, I would appreciate if you could take a look at the attached sstables
(produced using sstablemetadata) from two different time points over the
last 2 week (we ran compaction between).

In both cases, there are sstables generated around 8 pm that span over very
long time periods (sometimes over a day). We run repair daily at 8 pm.

Cheers,
Eugene









On Wed, Oct 11, 2017 at 12:53 PM, Jeff Jirsa <jjirsa@gmail.com> wrote:

> Anti-entropy repairs ("nodetool repair") and bootstrap/decom/removenode
> should stream sections of (and/or possibly entire) sstables from one
> replica to another. Assuming the original sstable was entirely contained in
> a single time window, the resulting sstable fragment streamed to the
> neighbor node will similarly be entirely contained within a single time
> window, and will be joined with the sstables in that window. If you find
> this isn't the case, open a JIRA, that's a bug (it was explicitly a design
> goal of TWCS, as it was one of my biggest gripes with early versions of
> DTCS).
>
> Read repairs, however, will pollute the memtable and cause overlaps. There
> are two types of read repairs:
> - Blocking read repair due to consistency level (read at quorum, and one
> of the replicas is missing data, the coordinator will issue mutations to
> the missing replica, which will go into the memtable and flush into the
> newest time window). This can not be disabled (period), and is probably the
> reason most people have overlaps (because people tend to read their writes
> pretty quickly after writes in time series use cases, often before hints or
> normal repair can be successful, especially in environments where nodes are
> bounced often).
> - Background read repair (tunable with the read_repair_chance and
> dclocal_read_repair_chance table options), which is like blocking read
> repair, but happens probabilistically (ie: there's a 1% chance on any read
> that the coordinator will scan the partition and copy any missing data to
> the replicas missing that data. Again, this goes to the memtable, and will
> flush into the newest time window).
>
> There's a pretty good argument to be made against manual repairs if (and
> only if) you only use TTLs, never explicitly delete data, and can tolerate
> the business risk of losing two machines at a time (that is: in the very
> very rare case that you somehow lose 2 machines before you can rebuild,
> you'll lose some subset of data that never made it to the sole remaining
> replica; is your business going to lose millions of dollars, or will you
> just have a gap in an analytics dashboard somewhere that nobody's going to
> worry about).
>
> - Jeff
>
>
> On Wed, Oct 11, 2017 at 9:24 AM, Sumanth Pasupuleti <
> spasupuleti@netflix.com.invalid> wrote:
>
>> Hi Eugene,
>>
>> Common contributors to overlapping SSTables are
>> 1. Hints
>> 2. Repairs
>> 3. New writes with old timestamps (should be rare but technically
>> possible)
>>
>> I would not run repairs with TWCS - as you indicated, it is going to
>> result in overlapping SSTables which impacts disk space and read latency
>> since reads now have to encompass multiple SSTables.
>>
>> As for https://issues.apache.org/jira/browse/CASSANDRA-13418, I would
>> not worry about data resurrection as long as all the writes carry TTL with
>> them.
>>
>> We faced similar overlapping issues with TWCS (it wss due to
>> dclocal_read_repair_chance) - we developed an SSTable tool that would give
>> topN or bottomN keys in an SSTable based on writetime/deletion time - we
>> used this to identify the specific keys responsible for overlap between
>> SSTables.
>>
>> Thanks,
>> Sumanth
>>
>>
>> On Mon, Oct 9, 2017 at 6:36 PM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks Alain!
>>>
>>> We are using TWCS compaction, and I read your blog multiple times - it
>>> was very useful, thanks!
>>>
>>> We are seeing a lot of overlapping SSTables, leading to a lot of
>>> problems: (a) large number of tombstones read in queries, (b) high CPU
>>> usage, (c) fairly long Young Gen GC collection (300ms)
>>>
>>> We have read_repair_change = 0, and unchecked_tombstone_compaction =
>>> true, gc_grace_seconds = 3h,  but we read and write with consistency =
>>> 1.
>>>
>>> I'm suspecting the overlap is coming from either hinted handoff or a
>>> repair job we run nightly.
>>>
>>> 1) Is running repair with TWCS recommended? It seems like it will
>>> always create a neverending overlap (the repair SSTable will have data from
>>> all 24 hours), an effect that seems to get amplified with anti-compaction.
>>> 2) TWCS seems to introduce a tradeoff between eventual consistency and
>>> write/read availability. If all repairs are turned off, then the choice is
>>> either (a) user strong consistency level, and pay the price of lower
>>> availability and slowers reads or writes, or (b) use lower consistency
>>> level, and risk inconsistent data (data is never repaired)
>>>
>>> I will try your last link but reappearing data sound a bit scary :)
>>>
>>> Any advice on how to debug this further would be greatly apprecaited.
>>>
>>> Cheers,
>>> Eugene
>>>
>>> On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <arodrime@gmail.com>
>>> wrote:
>>>
>>>> Hi Eugene,
>>>>
>>>> If we never use updates (time series data), is it safe to set
>>>>> gc_grace_seconds=0.
>>>>
>>>>
>>>> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
>>>> 'max_hint_window_in_ms' as the min off these 2 values is used for hints
>>>> storage window size in Apache Cassandra.
>>>>
>>>> Yet time series data with fixed TTLs allows a very efficient use of
>>>> Cassandra, specially when using Time Window Compaction Strategy (TWCS).
>>>> Funny fact is that Jeff brought it to Apache Cassandra :-). I would
>>>> definitely give it a try.
>>>>
>>>> Here is a post from my colleague Alex that I believe could be useful in
>>>> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>>>>
>>>> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
>>>> 'max_hint_window_in_ms' should be really effective. Make sure to use a
>>>> strong consistency level (generally RF = 3, CL.Read = CL.Write =
>>>> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
>>>> interest in consistency).
>>>>
>>>> This way you could expire entires SSTables, without compaction. If
>>>> overlaps in SSTables become a problem, you could even consider to give a
>>>> try to a more aggressive SSTable expiration
>>>> https://issues.apache.org/jira/browse/CASSANDRA-13418.
>>>>
>>>> C*heers,
>>>> -----------------------
>>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>>> France / Spain
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>>
>>>>
>>>> 2017-10-05 23:44 GMT+01:00 kurt greaves <kurt@instaclustr.com>:
>>>>
>>>>> No it's never safe to set it to 0 as you'll disable hinted handoff for
>>>>> the table. If you are never doing updates and manual deletes and you
always
>>>>> insert with a ttl you can get away with setting it to the hinted handoff
>>>>> period.
>>>>>
>>>>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eugene.miretsky@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Jeff,
>>>>>>
>>>>>> Make sense.
>>>>>> If we never use updates (time series data), is it safe to set
>>>>>> gc_grace_seconds=0.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jjirsa@gmail.com>
wrote:
>>>>>>
>>>>>>>
>>>>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies
>>>>>>> to TTL'd cells, because even though the data is TTL'd, it may
have been
>>>>>>> written on top of another live cell that wasn't ttl'd:
>>>>>>>
>>>>>>> Imagine a test table, simple key->value (k, v).
>>>>>>>
>>>>>>> INSERT INTO table(k,v) values(1,1);
>>>>>>> Kill 1 of the 3 nodes
>>>>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>>>>>> 60 seconds later, the live nodes will see that data as deleted,
but
>>>>>>> when that dead node comes back to life, it needs to learn of
the deletion.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> The following link says that TTLs generate tombstones -
>>>>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>>>>>
>>>>>>>> What exactly is the process that converts the TTL into a
tombstone?
>>>>>>>>
>>>>>>>>    1. Is an actual new tombstone cell created when the TTL
expires?
>>>>>>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>>>>>>
>>>>>>>>
>>>>>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>>>>>> gc_grace_period is meant to protect from deleted data re-appearing
if the
>>>>>>>> tombstone is compacted away before all nodes have reached
a consistent
>>>>>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>>>>>> there is no way for the cell to re-appear (the ttl will still
be there)
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

On Wed, Oct 11, 2017 at 9:53 AM, Jeff Jirsa <jjirsa@gmail.com> wrote:

> Anti-entropy repairs ("nodetool repair") and bootstrap/decom/removenode
> should stream sections of (and/or possibly entire) sstables from one
> replica to another. Assuming the original sstable was entirely contained in
> a single time window, the resulting sstable fragment streamed to the
> neighbor node will similarly be entirely contained within a single time
> window, and will be joined with the sstables in that window. If you find
> this isn't the case, open a JIRA, that's a bug (it was explicitly a design
> goal of TWCS, as it was one of my biggest gripes with early versions of
> DTCS).
>
> Read repairs, however, will pollute the memtable and cause overlaps. There
> are two types of read repairs:
> - Blocking read repair due to consistency level (read at quorum, and one
> of the replicas is missing data, the coordinator will issue mutations to
> the missing replica, which will go into the memtable and flush into the
> newest time window). This can not be disabled (period), and is probably the
> reason most people have overlaps (because people tend to read their writes
> pretty quickly after writes in time series use cases, often before hints or
> normal repair can be successful, especially in environments where nodes are
> bounced often).
> - Background read repair (tunable with the read_repair_chance and
> dclocal_read_repair_chance table options), which is like blocking read
> repair, but happens probabilistically (ie: there's a 1% chance on any read
> that the coordinator will scan the partition and copy any missing data to
> the replicas missing that data. Again, this goes to the memtable, and will
> flush into the newest time window).
>
> There's a pretty good argument to be made against manual repairs if (and
> only if) you only use TTLs, never explicitly delete data, and can tolerate
> the business risk of losing two machines at a time (that is: in the very
> very rare case that you somehow lose 2 machines before you can rebuild,
> you'll lose some subset of data that never made it to the sole remaining
> replica; is your business going to lose millions of dollars, or will you
> just have a gap in an analytics dashboard somewhere that nobody's going to
> worry about).
>
> - Jeff
>
>
> On Wed, Oct 11, 2017 at 9:24 AM, Sumanth Pasupuleti <
> spasupuleti@netflix.com.invalid> wrote:
>
>> Hi Eugene,
>>
>> Common contributors to overlapping SSTables are
>> 1. Hints
>> 2. Repairs
>> 3. New writes with old timestamps (should be rare but technically
>> possible)
>>
>> I would not run repairs with TWCS - as you indicated, it is going to
>> result in overlapping SSTables which impacts disk space and read latency
>> since reads now have to encompass multiple SSTables.
>>
>> As for https://issues.apache.org/jira/browse/CASSANDRA-13418, I would
>> not worry about data resurrection as long as all the writes carry TTL with
>> them.
>>
>> We faced similar overlapping issues with TWCS (it wss due to
>> dclocal_read_repair_chance) - we developed an SSTable tool that would give
>> topN or bottomN keys in an SSTable based on writetime/deletion time - we
>> used this to identify the specific keys responsible for overlap between
>> SSTables.
>>
>> Thanks,
>> Sumanth
>>
>>
>> On Mon, Oct 9, 2017 at 6:36 PM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks Alain!
>>>
>>> We are using TWCS compaction, and I read your blog multiple times - it
>>> was very useful, thanks!
>>>
>>> We are seeing a lot of overlapping SSTables, leading to a lot of
>>> problems: (a) large number of tombstones read in queries, (b) high CPU
>>> usage, (c) fairly long Young Gen GC collection (300ms)
>>>
>>> We have read_repair_change = 0, and unchecked_tombstone_compaction =
>>> true, gc_grace_seconds = 3h,  but we read and write with consistency =
>>> 1.
>>>
>>> I'm suspecting the overlap is coming from either hinted handoff or a
>>> repair job we run nightly.
>>>
>>> 1) Is running repair with TWCS recommended? It seems like it will
>>> always create a neverending overlap (the repair SSTable will have data from
>>> all 24 hours), an effect that seems to get amplified with anti-compaction.
>>> 2) TWCS seems to introduce a tradeoff between eventual consistency and
>>> write/read availability. If all repairs are turned off, then the choice is
>>> either (a) user strong consistency level, and pay the price of lower
>>> availability and slowers reads or writes, or (b) use lower consistency
>>> level, and risk inconsistent data (data is never repaired)
>>>
>>> I will try your last link but reappearing data sound a bit scary :)
>>>
>>> Any advice on how to debug this further would be greatly apprecaited.
>>>
>>> Cheers,
>>> Eugene
>>>
>>> On Fri, Oct 6, 2017 at 11:02 AM, Alain RODRIGUEZ <arodrime@gmail.com>
>>> wrote:
>>>
>>>> Hi Eugene,
>>>>
>>>> If we never use updates (time series data), is it safe to set
>>>>> gc_grace_seconds=0.
>>>>
>>>>
>>>> As Kurt pointed, you never want 'gc_grace_seconds' to be lower than
>>>> 'max_hint_window_in_ms' as the min off these 2 values is used for hints
>>>> storage window size in Apache Cassandra.
>>>>
>>>> Yet time series data with fixed TTLs allows a very efficient use of
>>>> Cassandra, specially when using Time Window Compaction Strategy (TWCS).
>>>> Funny fact is that Jeff brought it to Apache Cassandra :-). I would
>>>> definitely give it a try.
>>>>
>>>> Here is a post from my colleague Alex that I believe could be useful in
>>>> your case: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>>>>
>>>> Using TWCS and setting and lowering 'gc_grace_seconds' to the value of
>>>> 'max_hint_window_in_ms' should be really effective. Make sure to use a
>>>> strong consistency level (generally RF = 3, CL.Read = CL.Write =
>>>> LOCAL_QUORUM) to prevent inconsistencies I would say (and depending on your
>>>> interest in consistency).
>>>>
>>>> This way you could expire entires SSTables, without compaction. If
>>>> overlaps in SSTables become a problem, you could even consider to give a
>>>> try to a more aggressive SSTable expiration
>>>> https://issues.apache.org/jira/browse/CASSANDRA-13418.
>>>>
>>>> C*heers,
>>>> -----------------------
>>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>>> France / Spain
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>>
>>>>
>>>> 2017-10-05 23:44 GMT+01:00 kurt greaves <kurt@instaclustr.com>:
>>>>
>>>>> No it's never safe to set it to 0 as you'll disable hinted handoff for
>>>>> the table. If you are never doing updates and manual deletes and you
always
>>>>> insert with a ttl you can get away with setting it to the hinted handoff
>>>>> period.
>>>>>
>>>>> On 6 Oct. 2017 1:28 am, "eugene miretsky" <eugene.miretsky@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Jeff,
>>>>>>
>>>>>> Make sense.
>>>>>> If we never use updates (time series data), is it safe to set
>>>>>> gc_grace_seconds=0.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 4, 2017 at 5:59 PM, Jeff Jirsa <jjirsa@gmail.com>
wrote:
>>>>>>
>>>>>>>
>>>>>>> The TTL'd cell is treated as a tombstone. gc_grace_seconds applies
>>>>>>> to TTL'd cells, because even though the data is TTL'd, it may
have been
>>>>>>> written on top of another live cell that wasn't ttl'd:
>>>>>>>
>>>>>>> Imagine a test table, simple key->value (k, v).
>>>>>>>
>>>>>>> INSERT INTO table(k,v) values(1,1);
>>>>>>> Kill 1 of the 3 nodes
>>>>>>> UPDATE table USING TTL 60 SET v=1 WHERE k=1 ;
>>>>>>> 60 seconds later, the live nodes will see that data as deleted,
but
>>>>>>> when that dead node comes back to life, it needs to learn of
the deletion.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 4, 2017 at 2:05 PM, eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> The following link says that TTLs generate tombstones -
>>>>>>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useExpire.html.
>>>>>>>>
>>>>>>>> What exactly is the process that converts the TTL into a
tombstone?
>>>>>>>>
>>>>>>>>    1. Is an actual new tombstone cell created when the TTL
expires?
>>>>>>>>    2. Or, is the TTLed cell treated as a tombstone?
>>>>>>>>
>>>>>>>>
>>>>>>>> Also, does gc_grace_period have an effect on TTLed cells?
>>>>>>>> gc_grace_period is meant to protect from deleted data re-appearing
if the
>>>>>>>> tombstone is compacted away before all nodes have reached
a consistent
>>>>>>>> state. However, since the ttl is stored in the cell (in liveness_info),
>>>>>>>> there is no way for the cell to re-appear (the ttl will still
be there)
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Mime
View raw message