cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Tunnicliffe (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6517) Loss of secondary index entries if nodetool cleanup called before compaction
Date Mon, 13 Jan 2014 15:33:53 GMT


Sam Tunnicliffe commented on CASSANDRA-6517:

Cleanup is irrelevant here and I was able to repro on a single node. The root cause is the
incorrect use of CompactionManager.NO_GC (aka Long.MIN_VALUE) as a timestamp in PreCompactedRow.merge.
The sequence of events is like so:

* update 0 inserts the row, so indexes the column value 
* update 1 deletes the row with a RangeTombstone, which deletes the value from the 2i, but
leaves the original columns in the main cf's memtable
* update 2 re-inserts the row, now the main cf memtable still has the old col (which was being
shadowed by the RT) so it calls SIM.update - which inserts the new col into the 2i and tries
to delete the old column (which was already removed by update 1) - this results in another
tombstone being written to the 2i memtable, but this has the timestamp of the column from
the update 0 (the one who's 2i entry has already been removed) so it has no negative effect.
This is why the 2i query contines to work as expected until we flush/compact.

When we flush, the RT is written to the sstable. This means that at compaction time, when
we come to process the live column value from the sstable it is checked against the RT and
ends up being removed from the 2i because of the incorrect timestamp passed into deletionInfo.isDeleted
in PCR.merge. This index removal only hits the 2i memtable though, so although it prevents
queries working correctly it only does so until the node is restarted (clearing the 2i memtable).

I've attached a bash script which repros the problem & a patch to fix it. The patch includes
a new unit test and all the existing unit tests are still passing (though I didn't check any

> Loss of secondary index entries if nodetool cleanup called before compaction
> ----------------------------------------------------------------------------
>                 Key: CASSANDRA-6517
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: API
>         Environment: Ubuntu 12.0.4 with 8+ GB RAM and 40GB hard disk for data directory.
>            Reporter: Christoph Werres
>            Assignee: Sam Tunnicliffe
>             Fix For: 2.0.5
>         Attachments: 0001-CASSANDRA-6517-Use-column-timestamp-to-check-for-del.patch,
> From time to time we had the feeling of not getting all results that should have been
returned using secondary indexes. Now we tracked down some situations and found out, it happened:
> 1) To primary keys that were already deleted and have been re-created later on
> 2) After our nightly maintenance scripts were running
> We can reproduce now the following szenario:
> - create a row entry with an indexed column included
> - query it and use the secondary index criteria -> Success
> - delete it, query again -> entry gone as expected
> - re-create it with the same key, query it -> success again
> Now use in exactly that sequence
> nodetool cleanup
> nodetool flush
> nodetool compact
> When issuing the query now, we don't get the result using the index. The entry is indeed
available in it's table when I just ask for the key. Below is the exact copy-paste output
from CQL when I reproduced the problem with an example entry on on of our tables.
> mwerrch@mstc01401:/opt/cassandra$ current/bin/cqlsh Connected to 14-15-Cluster at localhost:9160.
> [cqlsh 4.1.0 | Cassandra 2.0.3 | CQL spec 3.1.1 | Thrift protocol 19.38.0] Use HELP for
> cqlsh> use mwerrch;
> cqlsh:mwerrch> desc tables;
> B4Container_Demo
> cqlsh:mwerrch> desc table "B4Container_Demo";
> CREATE TABLE "B4Container_Demo" (
>   key uuid,
>   archived boolean,
>   bytes int,
>   computer int,
>   deleted boolean,
>   description text,
>   doarchive boolean,
>   filename text,
>   first boolean,
>   frames int,
>   ifversion int,
>   imported boolean,
>   jobid int,
>   keepuntil bigint,
>   nextchunk text,
>   node int,
>   recordingkey blob,
>   recstart bigint,
>   recstop bigint,
>   simulationid bigint,
>   systemstart bigint,
>   systemstop bigint,
>   tapelabel bigint,
>   version blob,
>   PRIMARY KEY (key)
>   bloom_filter_fp_chance=0.010000 AND
>   caching='KEYS_ONLY' AND
>   comment='demo' AND
>   dclocal_read_repair_chance=0.000000 AND
>   gc_grace_seconds=604800 AND
>   index_interval=128 AND
>   read_repair_chance=1.000000 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   default_time_to_live=0 AND
>   speculative_retry='NONE' AND
>   memtable_flush_period_in_ms=0 AND
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>   compression={'sstable_compression': 'LZ4Compressor'};
> CREATE INDEX mwerrch_Demo_computer ON "B4Container_Demo" (computer);
> CREATE INDEX mwerrch_Demo_node ON "B4Container_Demo" (node);
> CREATE INDEX mwerrch_Demo_recordingkey ON "B4Container_Demo" (recordingkey);
> cqlsh:mwerrch> INSERT INTO "B4Container_Demo" (key,computer,node) VALUES (78c70562-1f98-3971-9c28-2c3d8e09c10f,
50, 50); cqlsh:mwerrch> select key,node,computer from "B4Container_Demo" where computer=50;
>  key                                  | node | computer
> --------------------------------------+------+----------
>  78c70562-1f98-3971-9c28-2c3d8e09c10f |   50 |       50
> (1 rows)
> cqlsh:mwerrch> DELETE FROM "B4Container_Demo" WHERE key=78c70562-1f98-3971-9c28-2c3d8e09c10f;
> cqlsh:mwerrch> select key,node,computer from "B4Container_Demo" where computer=50;
> (0 rows)
> cqlsh:mwerrch> INSERT INTO "B4Container_Demo" (key,computer,node) VALUES (78c70562-1f98-3971-9c28-2c3d8e09c10f,
50, 50); cqlsh:mwerrch> select key,node,computer from "B4Container_Demo" where computer=50;
>  key                                  | node | computer
> --------------------------------------+------+----------
>  78c70562-1f98-3971-9c28-2c3d8e09c10f |   50 |       50
> (1 rows)
> **********************************
> Now we execute (maybe from a different shell so we don't have to close this session)
from /opt/cassandra/current/bin directory:
> ./nodetool cleanup
> ./nodetool flush
> ./nodetool compact
> Going back to our CQL session the result will no longer be available if queried via the
> *********************************
> cqlsh:mwerrch> select key,node,computer from "B4Container_Demo" where computer=50;
> (0 rows)

This message was sent by Atlassian JIRA

View raw message