incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Rose <ianr...@fullstory.com>
Subject clarification on 100k tombstone limit in indexes
Date Sun, 10 Aug 2014 18:19:19 GMT
Hi -

On this page (
http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
the docs state:

Do not use an index [...] On a frequently updated or deleted column


and


> *Problems using an index on a frequently updated or deleted column*ΒΆ
> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>

Cassandra stores tombstones in the index until the tombstone limit reaches
> 100K cells. After exceeding the tombstone limit, the query that uses the
> indexed value will fail.



I'm afraid I don't really understand this limit from its (brief)
description.  I also saw this recent thread
<http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=Oyo47kQ@mail.gmail.com%3E>
but
I'm afraid it didn't help me much...


*SHORT VERSION*

If I have tens or hundreds of thousands of rows in a keyspace, where every
row has an indexed column that is updated O(10) times during the lifetime
of each row, is that going to cause problems for me?  If that 100k
limit is *per
row* then I should be fine but if that 100k limit is *per keyspace* then
I'd definitely exceed it quickly.


*FULL EXPLANATION*

In our system, items are created at a rate of ~10/sec.  Each item is
updated ~10 times over the next few minutes (although in rare cases the
number of updates, and the duration, might be several times as long).  Once
the last update is received for an item, we select it from Cassandra,
process the data, then delete the entire row.

The tricky bit is that sometimes (maybe 30-40% of the time) we don't
actually know when the last update has been received so we use a timeout:
if an item hasn't been updated for 30 minutes, then we assume it is done
and should process it as before (select, then delete).  So I am trying to
design a schema that will allow for efficient queries of the form "find me
all items that have not been updated in the past 30 minutes."  We plan to
call this query once a minute.

Here is my tentative schema:

CREATE TABLE items (
  item_id ascii,
  last_updated timestamp,
  item_data list<blob>,
  PRIMARY KEY (item_id)
)
plus an index on last_updated.

So updates to an existing item would just be "lookup by item_id, append new
data to item_data, and set last_updated to now".  And queries to find items
that have timed out would use the index on last_updated: "find all items
where last_updated < [now - 30 minutes]".

Assuming, that is, that the aforementioned 100k tombstone limit won't bring
this index crashing to a halt...

Any clarification on this limit and/or suggestions on a better way to
model/implement this system would be greatly appreciated!

Cheers,
Ian

Mime
View raw message