cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elden Bishop (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-5210) DB is randomly and undetectably corrupted during high traffic column family flushes
Date Thu, 31 Jan 2013 21:17:12 GMT
Elden Bishop created CASSANDRA-5210:
---------------------------------------

             Summary: DB is randomly and undetectably corrupted during high traffic column
family flushes 
                 Key: CASSANDRA-5210
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5210
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.2.1, 1.2.0, 1.1.9, 1.1.8, 1.1.7, 1.1.6, 1.1.5, 1.1.4, 1.1.3, 1.1.2,
1.1.1, 1.1.0, 0.8.10, 0.8.9, 0.8.8, 0.8.7, 0.8.6, 0.8.5, 0.8.4, 0.8.3, 0.8.2, 0.8.1
         Environment: Cassandra 0.8+, OS/X, java version "1.6.0_37" 
            Reporter: Elden Bishop


Writes during high traffic column family flushes corrupt the DB and make slice queries return
incorrect data.

Any multi-column write on any version of Cassandra can put the DB in a state where some columns
cannot be read alongside other columns.

eg.

{{
// *** for any NON-NULL column (eg. col_a=>AAA)
cqlsh> SELECT 'col_a' FROM test WHERE KEY='row_a';
   returns:     'AAA'

// *** it can disappear when queried alongside another column
cqlsh> SELECT 'col_a', 'col_b' FROM test WHERE KEY='row_a';
   returns:      null,   'BBB' // *** col_a is MISSING

// *** but it depends on the other columns
cqlsh> SELECT 'col_a', 'col_b', 'col_c' FROM test WHERE KEY='row_a';
   returns:     'AAA',   'BBB',   'CCC' // *** col_a is BACK
}}

Once in this state the database is corrupt and essentially returning random data depending
on what columns you query. Single column queries always return correct results so there is
no way to verify the data. No errors are logged during corruption and it is impossible to
detect without querying all combinations of all columns.

To reproduce:

1. Unzip a distribution of Cassandra and create a test.test column family.
2. In a loop alternate between updating either row 'a' or a random row.
   Write a random value to four random columns (out of 10000). Keep track
   of all columns set in row 'a'.
3. Each pass through the loop query four random columns (out of 10000) from row 'a'. If a
column that is known to be set is null, print out the columns that were requested during the
query.
4. The DB is now corrupt and will return the column if queried by itself but will return null
if queried alongside the columns that triggered the error. This is a permanent condition.


Observations: This bug only manifests directly after a high traffic column family flush occurs
in the log. This is a correlation based on simply watching the log. There are no errors or
warnings of any kind.

Workaround: Any multi-column read is potentially invalid and corruption is virtually undetectable.
The only workaround is never writing or reading more than a single column in a query.

I have a simple groovy script that can trigger the error. I have verified the behavior on
Cassandra versions as old as 0.8.1


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message