incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DuyHai Doan <>
Subject Re: clarification on 100k tombstone limit in indexes
Date Sun, 10 Aug 2014 21:20:27 GMT
Hello Ian

"It sounds like this 100k limit is, indeed, a "global" limit as opposed to
a per-row limit" -->The threshold applies to each "REQUEST", not partition
or globally.

The threshold does not apply to a partition (physical row) simply because
in one request you can fetch data from many partitions (multi get slice).
There was a JIRA about this here:

"Are these tombstones ever "GCed" out of the index?" --> Yes they are,
during compactions of the index column family.

"How frequently?" --> That's the real pain. Indeed you do not have any
control on the tuning of secondary index CF compaction. As far as I know,
the compaction settings (strategy, min/max thresholds...) inherits from the
one of the base table

Now, by looking very fast into your data model, it seems that you have a
skinny partition patter. Since you mentioned that the date is updated only
10 times max, you should not run into the tombstonne threshold issue.

On a side node, your usage of secondary index is not the best one. Indeed,
indexing the update date will lead to a situation where for one date,
you'll mostly have one or a few matching items (assuming that the update
date resolution is small enough and update rate is not intense). It is the
high-cardinality scenario to be avoided (
Plus, the query on the index (find all items where last_updated < [now - 30
minutes]) makes things worse since it is not an exact match but inequality.

 You better off create a manuel reverse-index to track modification date,
something like this:

CREATE TABLE last_updated_item (
    minute_bucket int, // format YYYYMMDDHHmm
    last_update_date timestamp,
    item_id ascii,
    PRIMARY KEY(minute_bucket, last_update_date)

 The last_update_date column is quite self-explanatory. The minute_bucket
is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00
to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole
day, you'd have 48 buckets. We need to put data into buckets to avoid ultra
wide rows since you mentioned that there are 10 items (so 10 updates) /
sec. Of course, 30 mins is just an exemple, you can tune it down to a
window of 5 minutes or 1 minute, depending on the insertion rate.

On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose <> wrote:

> Hi Mark -
> Thanks for the clarification but as I'm not too familiar with the nuts &
> bolts of Cassandra I'm not sure how to apply that info to my current
> situation.  It sounds like this 100k limit is, indeed, a "global" limit as
> opposed to a per-row limit.  Are these tombstones ever "GCed" out of the
> index?  How frequently?  If not, then it seems like *any* index is at risk
> of reaching this tipping point; it's just that indexes on frequently
> updated columns will reach this pointer faster the indexes on rarely
> updated columns.
> Basically I'm trying to get some kind of sense for what "frequently
> updated
> <>"
> means quantitatively.  As written, the docs make it sound dangerous to
> create an index on a column that is *ever* deleted or updated since there
> is no sense of how frequent is "too frequent".
> Cheers,
> Ian
> On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy <>
> wrote:
>> Hi Ian,
>> The issues here, which relates to normal and index column families, is
>> scanning over a large number of tombstones can cause Cassandra to fall over
>> due to increased GC pressure. This pressure is caused because tombstones
>> will create DeletedColumn objects which consume heap. Also
>> these DeletedColumn objects will have to be serialized and sent back to the
>> coordinator, thus increasing your response times. Take for example a row
>> that does deletes and you query it with a limit of 100. In a worst case
>> scenario you could end up reading say 50k tombstones to reach the 100
>> 'live' column limit, all of which has to be put on heap and then sent over
>> the wire to the coordinator. This would be considered a Cassandra
>> anti-pattern.[1]
>> With that in mind there was a debug warning added to 1.2 to inform the
>> user when they were querying a row with 1000 tombstones [2]. Then in 2.0
>> the action was taken to drop requests reaching 100k tombstones[3] rather
>> than just printing out a warning. This is a safety measure, as it is not
>> advised to perform such a query and is a result of most people 'doing it
>> wrong'.
>> For those people who understand the risk of scanning over large numbers
>> of tombstones there is a configuration option in the cassandra.yaml to
>> increase this threshold, tombstone_failure_threshold.[4]
>> Mark
>> [1]
>> [2]
>> [3]
>> [4]
>> On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <> wrote:
>>> Hi -
>>> On this page (
>>> the docs state:
>>> Do not use an index [...] On a frequently updated or deleted column
>>> and
>>>> *Problems using an index on a frequently updated or deleted column*ΒΆ
>>>> <>
>>> Cassandra stores tombstones in the index until the tombstone limit
>>>> reaches 100K cells. After exceeding the tombstone limit, the query that
>>>> uses the indexed value will fail.
>>> I'm afraid I don't really understand this limit from its (brief)
>>> description.  I also saw this recent thread
>>> <>
>>> I'm afraid it didn't help me much...
>>> If I have tens or hundreds of thousands of rows in a keyspace, where
>>> every row has an indexed column that is updated O(10) times during the
>>> lifetime of each row, is that going to cause problems for me?  If that 100k
>>> limit is *per row* then I should be fine but if that 100k limit is *per
>>> keyspace* then I'd definitely exceed it quickly.
>>> In our system, items are created at a rate of ~10/sec.  Each item is
>>> updated ~10 times over the next few minutes (although in rare cases the
>>> number of updates, and the duration, might be several times as long).  Once
>>> the last update is received for an item, we select it from Cassandra,
>>> process the data, then delete the entire row.
>>> The tricky bit is that sometimes (maybe 30-40% of the time) we don't
>>> actually know when the last update has been received so we use a timeout:
>>> if an item hasn't been updated for 30 minutes, then we assume it is done
>>> and should process it as before (select, then delete).  So I am trying to
>>> design a schema that will allow for efficient queries of the form "find me
>>> all items that have not been updated in the past 30 minutes."  We plan to
>>> call this query once a minute.
>>> Here is my tentative schema:
>>> CREATE TABLE items (
>>>   item_id ascii,
>>>   last_updated timestamp,
>>>   item_data list<blob>,
>>>   PRIMARY KEY (item_id)
>>> )
>>> plus an index on last_updated.
>>> So updates to an existing item would just be "lookup by item_id, append
>>> new data to item_data, and set last_updated to now".  And queries to find
>>> items that have timed out would use the index on last_updated: "find all
>>> items where last_updated < [now - 30 minutes]".
>>> Assuming, that is, that the aforementioned 100k tombstone limit won't
>>> bring this index crashing to a halt...
>>> Any clarification on this limit and/or suggestions on a better way to
>>> model/implement this system would be greatly appreciated!
>>> Cheers,
>>> Ian

View raw message