cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andreas Wederbrand (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-12756) Duplicate (cql)rows for the same primary key
Date Thu, 06 Oct 2016 20:11:20 GMT
Andreas Wederbrand created CASSANDRA-12756:
----------------------------------------------

             Summary: Duplicate (cql)rows for the same primary key
                 Key: CASSANDRA-12756
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12756
             Project: Cassandra
          Issue Type: Bug
          Components: Compaction, CQL
         Environment: Linux, Cassandra 3.7 (upgraded at one point from 2.?).
            Reporter: Andreas Wederbrand
            Priority: Minor


I observe what looks like duplicates when I run cql queries against a table. It only show
for rows written during a couple of hours on a specific date but it shows for several partions
and serveral clustering keys for each partition during that time range.

We've loaded data in two ways. 
1) through a normal insert
2) through sstableloader with sstables created using update-statements (to append to the map)
and an older version of SSTableWriter. During this processes several months of data was re-loaded.


The table DDL is 
CREATE TABLE climate.climate_1510 (
    installation_id bigint,
    node_id bigint,
    time_bucket int,
    gateway_time timestamp,
    humidity map<int, float>,
    temperature map<int, float>,
    PRIMARY KEY ((installation_id, node_id, time_bucket), gateway_time)
) WITH CLUSTERING ORDER BY (gateway_time DESC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

and the result from the SELECT is
> select * from climate.climate_1510 where installation_id = 133235 and node_id = 35453983
and time_bucket = 189 and gateway_time > '2016-08-10 20:00:00' and gateway_time < '2016-08-10
21:00:00' ;

 installation_id | node_id  | time_bucket | gateway_time             | humidity | temperature
-----------------+----------+-------------+--------------------------+----------+---------------
          133235 | 35453983 |         189 | 20160810 20:23:28.000000 |  {0: 51} | {0: 24.37891}
          133235 | 35453983 |         189 | 20160810 20:23:28.000000 |  {0: 51} | {0: 24.37891}
          133235 | 35453983 |         189 | 20160810 20:23:28.000000 |  {0: 51} | {0: 24.37891}

I've used Andrew Tolbert's sstable-tools to be able to dump the json for this specific time
and this is what I find. 

[133235:35453983:189] Row[info=[ts=1470878906618000] ]: gateway_time=2016-08-10 22:23+0200
| del(humidity)=deletedAt=1470878906617999, localDeletion=1470878906, [humidity[0]=51.0 ts=1470878906618000],
del(temperature)=deletedAt=1470878906617999, localDeletion=1470878906, [temperature[0]=24.378906
ts=1470878906618000]
[133235:35453983:189] Row[info=[ts=-9223372036854775808] del=deletedAt=1470864506441999, localDeletion=1470864506
]: gateway_time=2016-08-10 22:23+0200 | , [humidity[0]=51.0 ts=1470878906618000], , [temperature[0]=24.378906
ts=1470878906618000]
[133235:35453983:189] Row[info=[ts=-9223372036854775808] del=deletedAt=1470868106489000, localDeletion=1470868106
]: gateway_time=2016-08-10 22:23+0200 | 
[133235:35453983:189] Row[info=[ts=-9223372036854775808] del=deletedAt=1470871706530999, localDeletion=1470871706
]: gateway_time=2016-08-10 22:23+0200 | 
[133235:35453983:189] Row[info=[ts=-9223372036854775808] del=deletedAt=1470878906617999, localDeletion=1470878906
]: gateway_time=2016-08-10 22:23+0200 | , [humidity[0]=51.0 ts=1470878906618000], , [temperature[0]=24.378906
ts=1470878906618000]

>From my understanding this should be impossible. Even if we have duplicates in the sstables
(which is normal) it should be filtered away before being returned to the client.

I'm happy to add details to this bug if anything is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message