cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-5210) DB is randomly and undetectably corrupted during high traffic column family flushes
Date Thu, 31 Jan 2013 23:43:12 GMT


Jonathan Ellis commented on CASSANDRA-5210:

This sounds a lot like a custom comparator that doesn't actually impose a total ordering of
its data.
> DB is randomly and undetectably corrupted during high traffic column family flushes 
> ------------------------------------------------------------------------------------
>                 Key: CASSANDRA-5210
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.1, 0.8.2, 0.8.3, 0.8.4, 0.8.5, 0.8.6, 0.8.7, 0.8.8, 0.8.9, 0.8.10,
1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5, 1.1.6, 1.1.7, 1.1.8, 1.1.9, 1.2.0, 1.2.1
>         Environment: Cassandra 0.8+, OS/X, java version "1.6.0_37" 
>            Reporter: Elden Bishop
> Writes during high traffic column family flushes corrupt the DB and make slice queries
return incorrect data.
> Any multi-column write on any version of Cassandra can put the DB in a state where some
columns cannot be read alongside other columns.
> eg.
> {{
> // *** for any NON-NULL column (eg. col_a=>AAA)
> cqlsh> SELECT 'col_a' FROM test WHERE KEY='row_a';
>    returns:     'AAA'
> // *** it can disappear when queried alongside another column
> cqlsh> SELECT 'col_a', 'col_b' FROM test WHERE KEY='row_a';
>    returns:      null,   'BBB' // *** col_a is MISSING
> // *** but it depends on the other columns
> cqlsh> SELECT 'col_a', 'col_b', 'col_c' FROM test WHERE KEY='row_a';
>    returns:     'AAA',   'BBB',   'CCC' // *** col_a is BACK
> }}
> Once in this state the database is corrupt and essentially returning random data depending
on what columns you query. Single column queries always return correct results so there is
no way to verify the data. No errors are logged during corruption and it is impossible to
detect without querying all combinations of all columns.
> To reproduce:
> 1. Unzip a distribution of Cassandra and create a test.test column family.
> 2. In a loop alternate between updating either row 'a' or a random row.
>    Write a random value to four random columns (out of 10000). Keep track
>    of all columns set in row 'a'.
> 3. Each pass through the loop query four random columns (out of 10000) from row 'a'.
If a column that is known to be set is null, print out the columns that were requested during
the query.
> 4. The DB is now corrupt and will return the column if queried by itself but will return
null if queried alongside the columns that triggered the error. This is a permanent condition.
> Observations: This bug only manifests directly after a high traffic column family flush
occurs in the log. This is a correlation based on simply watching the log. There are no errors
or warnings of any kind.
> Workaround: Any multi-column read is potentially invalid and corruption is virtually
undetectable. The only workaround is never writing or reading more than a single column in
a query.
> I have a simple groovy script that can trigger the error. I have verified the behavior
on Cassandra versions as old as 0.8.1

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message