cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Goodrich (JIRA)" <>
Subject [jira] [Created] (CASSANDRA-6918) Compaction Assert: Incorrect Row Data Size
Date Mon, 24 Mar 2014 19:22:46 GMT
Alexander Goodrich created CASSANDRA-6918:

             Summary: Compaction Assert: Incorrect Row Data Size
                 Key: CASSANDRA-6918
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: 11 node Linux Cassandra 1.2.15 cluster, each node configured as follows:
2P IntelXeon CPU X5660 @ 2.8 GHz (12 cores, 24 threads total)
148 GB RAM
CentOS release 6.4 (Final)
2.6.32-358.11.1.el6.x86_64 #1 SMP Wed May 15 10:48:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)

Node configuration:
Default cassandra.yaml settings for the most part with the following exceptions:
rpc_server_type: hsha

            Reporter: Alexander Goodrich
             Fix For: 1.2.16

I have four tables in a schema with Replication Factor: 6 (previously we set this to 3, but
when we added more nodes we figured adding more replication to improve read time would help,
this might have aggravated the issue).

create table table_value_one (
    id timeuuid PRIMARY KEY,
    value_1 counter
create table table_value_two (
    id timeuuid PRIMARY KEY,
    value_2 counter

create table table_position_lookup (
    value_1 bigint,
    value_2 bigint,
    id timeuuid,
    PRIMARY KEY (id)
    ) WITH compaction={'class': 'LeveledCompactionStrategy'};

create table sorted_table (
    row_key_index text,
    range bigint,
    sorted_value bigint,
    id timeuuid,
    extra_data list<bigint>,
    PRIMARY KEY ((row_key_index, range), sorted_value, id)
      compaction={'class': 'LeveledCompactionStrategy'};

The application creates an object, and stores it in sorted_table based on a value position
- for example, an object has a value_1 of 5500, and a value_2 of 4300.

There are rows which represent indices by which I can sort items based on these values in
descending order. If I wish to see items with the highest # of value_1, I can create an index
that stores them like so:

row_key_index = 'highest_value_1s'

Additionally, we shard each row by bucket ranges - which is simply the value_1 or value_2
/ 1000. For example, our object above would be found in row_key_index = 'highest_value_1s'
and range 5000, and also in row_key_index = 'highest_value_2s' with range 4300.

The true values of this object are stored in two counter tables, table_value_one and table_value_two.
The current indexed position is stored in table_position_lookup.

We allow the application to modify value_one and value_two in the counter table indiscriminately.
If we know the current values for these are dirty, we wait a tuned amount of time before we
update the position in the sorted_table index. This creates 2 delete operations, and 2 write
operations on the same table.

The issue is when we expand the number of write/delete operations on sorted_table, we see
the following assert in the system log:

ERROR [CompactionExecutor:169] 2014-03-24 08:07:12,871 (line 191) Exception
in thread Thread[CompactionExecutor:169,1,main]
java.lang.AssertionError: incorrect row data size 77705872 written to /var/lib/cassandra/data/loadtest_1/sorted_table/loadtest_1-sorted_table-tmp-ic-165-Data.db;
correct is 77800512
        at org.apache.cassandra.db.compaction.CompactionTask.runWith(
        at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(
        at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(
        at org.apache.cassandra.db.compaction.CompactionManager$
        at java.util.concurrent.Executors$
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$

Each object creates approximately ~500 unique row keys in sorted_table, and it possesses an
extra_data field containing approximately 15 different bigint values.

Previously, our application was running Cassandra 1.2.10 and we did not see the assert when
our sorted_table did not have the "extra data list<bigint>". Also, we were writing around
~200 unique row keys, only containing the ID column.

We tried both leveled compaction and size tiered compaction and both cause the same assert
- compaction fails to happen, and after about 100k object writes (creating 55 million rows,
each having potentially as many as 100k items in a single column), we have ~ 2.4 GB of SSTables
spread across 4840 files, and 691 SSTables:

		SSTable count: 691
                SSTables in each level: [685/4, 6, 0, 0, 0, 0, 0, 0, 0]
                Space used (live): 2244774352
                Space used (total): 2251159892
                SSTable Compression Ratio: 0.15101393198465862
                Number of Keys (estimate): 4704128
                Memtable Columns Count: 0
                Memtable Data Size: 0
                Memtable Switch Count: 264
                Read Count: 9204
                Read Latency: NaN ms.
                Write Count: 10151343
                Write Latency: NaN ms.
                Pending Tasks: 0
                Bloom Filter False Positives: 0
                Bloom Filter False Ratio: 0.00000
                Bloom Filter Space Used: 3500496
                Compacted row minimum size: 125
                Compacted row maximum size: 62479625
                Compacted row mean size: 1285302
                Average live cells per slice (last five minutes): 1001.0
                Average tombstones per slice (last five minutes): 8566.5

Some mitigation strategies we have discussed include:
* Breaking sorted_table into multiple column families to spread the # of writes between.
* Increasing the coalescing time delay
* Removing extra_data and paying the cost of another table look up for each item
* Compressing extra_data into a blob
* Reduce replication factor back down to 3 to reduce size pressure on SSTable.

Running nodetool -pr repair does not fix the issue. Running nodetool compact manually has
not solved the issue as well. The asserts happen pretty frequently across all nodes of the

This message was sent by Atlassian JIRA

View raw message