Querying a table with 5000 thousands tombstones take 3 minutes to complete!
But Querying the same table with the same data pattern with 10,000 entries takes a fraction of second to complete!


Details:
1. created the following table:
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
use test;
CREATE TABLE job_index (   stage text,   "timestamp" text,   PRIMARY KEY (stage, "timestamp")); 

2. inserted 5000 entries to the table:
INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00000001' );
INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00000002' );
....
INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00004999' );
INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00005000' );

3. flushed the table:
nodetool flush test job_index

4. deleted the 5000 entries:
DELETE from job_index WHERE stage ='a' AND timestamp = '00000001' ;
DELETE from job_index WHERE stage ='a' AND timestamp = '00000002' ;
...
DELETE from job_index WHERE stage ='a' AND timestamp = '00004999' ;
DELETE from job_index WHERE stage ='a' AND timestamp = '00005000' ;

5. flushed the table:
nodetool flush test job_index

6. querying the table takes 3 minutes to complete:
cqlsh:test> SELECT * from job_index limit 20000;
tracing:

while query was getting executed I saw a lot of GC entries in cassandra's log:
DEBUG [ScheduledTasks:1] 2013-07-01 23:47:59,221 GCInspector.java (line 121) GC for ParNew: 30 ms for 6 collections, 263993608 used; max is 2093809664
DEBUG [ScheduledTasks:1] 2013-07-01 23:48:00,222 GCInspector.java (line 121) GC for ParNew: 29 ms for 6 collections, 186209616 used; max is 2093809664
DEBUG [ScheduledTasks:1] 2013-07-01 23:48:01,223 GCInspector.java (line 121) GC for ParNew: 29 ms for 6 collections, 108731464 used; max is 2093809664

It seems that something very inefficient is happening in managing tombstones.

If I start with a clean table and do the following:
1. insert 5000 entries
2. flush to disk
3. insert new 5000 entries
4. flush to disk
Querying the job_index for all the 10,000 entries takes a fraction of second to complete:
tracing:

The fact that iterating over 5000 tombstones takes 3 minutes but iterating over 10,000 live cells takes fraction of a second to suggest that something very inefficient is happening in managing tombstones.

I appreciate if any developer can look into this.

-M