cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DuyHai Doan <doanduy...@gmail.com>
Subject Re: Re : Purging tombstones from a particular row in SSTable
Date Sat, 30 Jul 2016 11:52:12 GMT
Look like skipping SSTables based on max SSTable timestamp is possible if
your have the partition deletion info:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L538-L550

But it doesn't say nothing about iterating all cells in a single partition
if having a partition tombstone, I need to dig further




On Sat, Jul 30, 2016 at 2:03 AM, Eric Stevens <mightye@gmail.com> wrote:

> I haven't tested that specifically, but I haven't bumped into any
> particular optimization that allows it to skip reading an sstable where the
> entire relevant partition has been row-tombstoned.  It's possible that
> something like that could happen by examining min/max timestamps on
> sstables, and not reading from any sstable with a partition-level tombstone
> where the max timestamp is less than the timestamp of the partition
> tombstone.  However that presumes that it can have read the tombstones from
> each sstable before it read the occluded data, which I don't think is
> likely.
>
> Such an optimization could be there, but I haven't noticed it if it is,
> though I'm certainly not an expert (more of a well informed novice).  If
> someone wants to set me straight on this point I'd love to know about it.
>
> On Fri, Jul 29, 2016 at 2:37 PM DuyHai Doan <doanduyhai@gmail.com> wrote:
>
>> @Eric
>>
>> Very interesting example. But then what is the case of row (should I say
>> partition ?) tombstones ?
>>
>> Suppose that in your example, I issued a DELETE FROM foo WHERE pk='a'
>>
>> With the same SELECT statement than before, would C* be clever enough to
>> skip reading at all the whole partition (let's limit the example to a
>> single SSTable) ?
>>
>> On Fri, Jul 29, 2016 at 7:00 PM, Eric Stevens <mightye@gmail.com> wrote:
>>
>>> > Sai was describing a timeout, not a failure due to the 100 K
>>> tombstone limit from cassandra.yaml. But I still might be missing things
>>> about tombstones.
>>>
>>> The trouble with tombstones is not the tombstones themselves, rather
>>> it's that Cassandra has a lot of deleted data to read through in sstables
>>> in order to satisfy a query.  Although if you range constrain your cluster
>>> key in your query, the read path can optimize that read to start somewhere
>>> near the correct head of your selected data, that is _not_ true for
>>> tombstoned data.
>>>
>>> Consider this exercise:
>>> CREATE TABLE foo (
>>>   pk text,
>>>   ck int,
>>>   PRIMARY KEY ((pk), ck)
>>> )
>>> INSERT INTO foo (pk,ck) VALUES ('a', 1)
>>> ...
>>> INSERT INTO foo (pk,ck) VALUES ('a', 100000)
>>>
>>> $ nodetool flush
>>>
>>> DELETE FROM foo WHERE pk='a' AND ck < 99999
>>>
>>> We've now written a single "tiny" (bytes-wise) range tombstone.
>>>
>>> Now try to select from that table:
>>> SELECT * FROM foo WHERE ck > 50000 LIMIT 1
>>> pk | ck
>>> -- | ------
>>> a  | 100000
>>>
>>> This has to read from the first sstable, skipping over 49999 records
>>> before it can locate the first non-tombstoned cell.
>>>
>>> The problem isn't the size of the tombstone, tombstones themselves are
>>> cheaper (again, bytes-wise) than standard columns because they don't
>>> involve any value for the cell.  The problem is that the read path cannot
>>> anticipate in advance what cells are going to be occluded by the tombstone,
>>> and in order to satisfy the query it needs to read then discard a large
>>> number of deleted cells.
>>>
>>> The reason the thresholds exist in cassandra.yaml is to help guide users
>>> away from performance anti-patterns that come from selects which include a
>>> large number of tombstoned cells.
>>>
>>> On Thu, Jul 28, 2016 at 11:08 PM Alain RODRIGUEZ <arodrime@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> @Eric
>>>>
>>>> Large range tombstones can occupy just a few bytes but can occlude
>>>>> millions of records, and have the corresponding performance impact on
>>>>> reads.  It's really not the size of the tombstone on disk that matters,
but
>>>>> the number of records it occludes.
>>>>
>>>>
>>>> Sai was describing a timeout, not a failure due to the 100 K tombstone
>>>> limit from cassandra.yaml. But I still might be missing things about
>>>> tombstones.
>>>>
>>>> The read queries are continuously failing though because of the
>>>>> tombstones. "Request did not complete within rpc_timeout."
>>>>>
>>>>
>>>> So that is what looks weird to me. Reading 220 kb, even holding
>>>> tombstone should probably not take that long... Or am I wrong or missing
>>>> something?
>>>>
>>>> Your talk looks like cool stuff :-).
>>>>
>>>> @Sai
>>>>
>>>> The issues here was that tombstones were not in the SSTable, but rather
>>>>> in the Memtable
>>>>
>>>>
>>>> This sounds weird to me as well, knowing that memory is faster than
>>>> disk and that memtables are mutable data (so less stuff to read from
>>>> there). Flushing might have triggered some compaction, removing tombstones
>>>> though.
>>>>
>>>> This still sounds very weird to me but I am glad you solved your issue
>>>> (temporary at least).
>>>>
>>>> C*heers,
>>>> -----------------------
>>>> Alain Rodriguez - alain@thelastpickle.com
>>>> France
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>> 2016-07-29 3:25 GMT+02:00 Eric Stevens <mightye@gmail.com>:
>>>>
>>>>> Tombstones will not get removed even after gc_grace if bloom filters
>>>>> indicate that there is overlapping data with the tombstone's partition
in a
>>>>> different sstable.  This is because compaction can't be certain that
the
>>>>> tombstone doesn't overlap data in that other table.  If you're writing
to
>>>>> one end of a partition key while deleting off the other end (for example
>>>>> you've created engaged in the queue anti-pattern), your tombstones will
>>>>> essentially never go away.
>>>>>
>>>>>
>>>>>> 220kb worth of tombstones doesn’t seem like enough to worry about.
>>>>>
>>>>>
>>>>> Large range tombstones can occupy just a few bytes but can occlude
>>>>> millions of records, and have the corresponding performance impact on
>>>>> reads.  It's really not the size of the tombstone on disk that matters,
but
>>>>> the number of records it occludes.
>>>>>
>>>>> You must either do a full compaction (while also not writing to the
>>>>> partitions being considered, and after you've forced a cluster-wide flush,
>>>>> and after the tombstones are gc_grace old, and assuming size tiered and
not
>>>>> leveled compaction) to get rid of those tombstones, or probably easier
is
>>>>> to do something similar to sstable2json, remove the tombstones by hand,
>>>>> then json2sstable and replace the offending sstable.  Note that you really
>>>>> have to be certain what you're doing here or you'll end up resurrecting
>>>>> deleted records.
>>>>>
>>>>> If these all sound like bad options, it's because they are, and you
>>>>> don't have a lot of options without changing your schema to eventually
stop
>>>>> writing to (and especially reading from) partitions which you also do
>>>>> deletes on.  https://issues.apache.org/jira/browse/CASSANDRA-7019 proposes
>>>>> to offer a better alternative, but it's still in progress.
>>>>>
>>>>> Shameless plug, I'm talking about my company's alternative to
>>>>> tombstones and TTLs at this year's Cassandra Summit:
>>>>> http://myeventagenda.com/sessions/1CBFC920-807D-41C1-942C-8D1A7C10F4FA/5/5#sessionID=165
>>>>>
>>>>>
>>>>> On Thu, Jul 28, 2016 at 11:07 AM sai krishnam raju potturi <
>>>>> pskraju88@gmail.com> wrote:
>>>>>
>>>>>> thanks a lot Alain. That was really great info.
>>>>>>
>>>>>> The issues here was that tombstones were not in the SSTable, but
>>>>>> rather in the Memtable. We had to a nodetool flush, and run a nodetool
>>>>>> compact to get rid of the tombstones, a million of them. The size
of the
>>>>>> largest SSTable was actually 48MB.
>>>>>>
>>>>>> This link was helpful in getting the count of tombstones in a
>>>>>> sstable, which was 0 in our case.
>>>>>> https://gist.github.com/JensRantil/063b7c56ca4a8dfe1c50
>>>>>>
>>>>>>     The application team did not have a good model. They are working
>>>>>> on a new datamodel.
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> On Wed, Jul 27, 2016 at 7:17 PM, Alain RODRIGUEZ <arodrime@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I just released a detailed post about tombstones today that might
be
>>>>>>> of some interest for you:
>>>>>>> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
>>>>>>>
>>>>>>> 220kb worth of tombstones doesn’t seem like enough to worry
about.
>>>>>>>
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> I believe you might be missing some other bigger SSTable having
a
>>>>>>> lot of tombstones as well. Finding the biggest sstable and reading
the
>>>>>>> tombstone ratio from there might be more relevant.
>>>>>>>
>>>>>>> You also should give a try to: "unchecked_tombstone_compaction"
set
>>>>>>> to true rather than tuning other options so aggressively. The
"single
>>>>>>> SSTable compaction" section of my post might help you on this
issue:
>>>>>>> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#single-sstable-compaction
>>>>>>>
>>>>>>> Other thoughts:
>>>>>>>
>>>>>>> Also if you use TTLs and timeseries, using TWCS instead of STCS
>>>>>>> could be more efficient evicting tombstones.
>>>>>>>
>>>>>>> we have a columnfamily that has around 1000 rows, with one row
is
>>>>>>>> really huge (million columns)
>>>>>>>
>>>>>>>
>>>>>>> I am sorry to say that this model does not look that great.
>>>>>>> Imbalances might become an issue as a few nodes will handle a
lot more load
>>>>>>> than the rest of the nodes. Also even if this is getting improved
in newer
>>>>>>> versions of Cassandra, wide rows are something you want to avoid
while
>>>>>>> using 2.0.14 (which is no longer supported for about a year now).
I know it
>>>>>>> is not always easy and never the good time, but maybe should
you consider
>>>>>>> upgrading both your model and your version of Cassandra (regardless
of the
>>>>>>> fact you manage to solve this issue or not with
>>>>>>> "unchecked_tombstone_compaction").
>>>>>>>
>>>>>>> Good luck,
>>>>>>>
>>>>>>> C*heers,
>>>>>>> -----------------------
>>>>>>> Alain Rodriguez - alain@thelastpickle.com
>>>>>>> France
>>>>>>>
>>>>>>> The Last Pickle - Apache Cassandra Consulting
>>>>>>> http://www.thelastpickle.com
>>>>>>>
>>>>>>> 2016-07-28 0:00 GMT+02:00 sai krishnam raju potturi <
>>>>>>> pskraju88@gmail.com>:
>>>>>>>
>>>>>>>> The read queries are continuously failing though because
of the
>>>>>>>> tombstones. "Request did not complete within rpc_timeout."
>>>>>>>>
>>>>>>>> thanks
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 27, 2016 at 5:51 PM, Jeff Jirsa <
>>>>>>>> jeff.jirsa@crowdstrike.com> wrote:
>>>>>>>>
>>>>>>>>> 220kb worth of tombstones doesn’t seem like enough
to worry about.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *sai krishnam raju potturi <pskraju88@gmail.com>
>>>>>>>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org
>>>>>>>>> >
>>>>>>>>> *Date: *Wednesday, July 27, 2016 at 2:43 PM
>>>>>>>>> *To: *Cassandra Users <user@cassandra.apache.org>
>>>>>>>>> *Subject: *Re: Re : Purging tombstones from a particular
row in
>>>>>>>>> SSTable
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> and also the sstable size in question is like 220 kb
in size.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 5:41 PM, sai krishnam raju potturi
<
>>>>>>>>> pskraju88@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> it's set to 1800 Vinay.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  bloom_filter_fp_chance=0.010000 AND
>>>>>>>>>
>>>>>>>>>   caching='KEYS_ONLY' AND
>>>>>>>>>
>>>>>>>>>   comment='' AND
>>>>>>>>>
>>>>>>>>>   dclocal_read_repair_chance=0.100000 AND
>>>>>>>>>
>>>>>>>>>   gc_grace_seconds=1800 AND
>>>>>>>>>
>>>>>>>>>   index_interval=128 AND
>>>>>>>>>
>>>>>>>>>   read_repair_chance=0.000000 AND
>>>>>>>>>
>>>>>>>>>   replicate_on_write='true' AND
>>>>>>>>>
>>>>>>>>>   populate_io_cache_on_flush='false' AND
>>>>>>>>>
>>>>>>>>>   default_time_to_live=0 AND
>>>>>>>>>
>>>>>>>>>   speculative_retry='99.0PERCENTILE' AND
>>>>>>>>>
>>>>>>>>>   memtable_flush_period_in_ms=0 AND
>>>>>>>>>
>>>>>>>>>   compaction={'min_sstable_size': '1024', 'tombstone_threshold':
>>>>>>>>> '0.01', 'tombstone_compaction_interval': '1800', 'class':
>>>>>>>>> 'SizeTieredCompactionStrategy'} AND
>>>>>>>>>
>>>>>>>>>   compression={'sstable_compression': 'LZ4Compressor'};
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 5:34 PM, Vinay Kumar Chella <
>>>>>>>>> vinaykumarcse@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> What is your GC_grace_seconds set to?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 1:13 PM, sai krishnam raju potturi
<
>>>>>>>>> pskraju88@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> thanks Vinay and DuyHai.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     we are using verison 2.0.14. I did "user defined
compaction"
>>>>>>>>> following the instructions in the below link, The tombstones
still persist
>>>>>>>>> even after that.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://gist.github.com/jeromatron/e238e5795b3e79866b83
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_jeromatron_e238e5795b3e79866b83&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=-sQ3Vf5bs3z4cO36h_AU-kIhMGVKcb3eCtzIb-fZ1Fc&s=0RQ3r6c0L4vICot8eqpOBKBAuKiKEkoKdmcjLbvBBwY&e=>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also, we changed the tombstone_compaction_interval :
1800
>>>>>>>>> and tombstone_threshold : 0.1, but it did not help.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 4:05 PM, DuyHai Doan <doanduyhai@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> This feature is also exposed directly in nodetool from
version
>>>>>>>>> Cassandra 3.4
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> nodetool compact --user-defined <SSTable file>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 9:58 PM, Vinay Chella <vchella@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> You can run file level compaction using JMX to get rid
of
>>>>>>>>> tombstones in one SSTable. Ensure you set GC_Grace_seconds
such that
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> current time >= deletion(tombstone time)+ GC_Grace_seconds
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> File level compaction
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /usr/bin/java -jar cmdline-jmxclient-0.10.3.jar - localhost:
>>>>>>>>>
>>>>>>>>> ​{​
>>>>>>>>>
>>>>>>>>> ​port}
>>>>>>>>>
>>>>>>>>>  org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction="'${KEYSPACE}','${
>>>>>>>>>
>>>>>>>>> ​SSTABLEFILENAME
>>>>>>>>>
>>>>>>>>> }'""
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 27, 2016 at 11:59 AM, sai krishnam raju potturi
<
>>>>>>>>> pskraju88@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> hi;
>>>>>>>>>
>>>>>>>>>   we have a columnfamily that has around 1000 rows, with
one row
>>>>>>>>> is really huge (million columns). 95% of the row contains
tombstones. Since
>>>>>>>>> there exists just one SSTable , there is going to be
no compaction kicked
>>>>>>>>> in. Any way we can get rid of the tombstones in that
row?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Userdefined compaction nor nodetool compact had no effect.
Any
>>>>>>>>> ideas folks?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>

Mime
View raw message