cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vassil Hristov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-4274) Cassandra cluster becomes very slow and looses data after node failure
Date Wed, 23 May 2012 12:07:40 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vassil Hristov updated CASSANDRA-4274:
--------------------------------------

    Attachment: catalina.2012-05-23.log
                system.192.168.1.7.log
                output.192.168.1.7.log
                system.192.168.1.8.log
                output.192.168.1.8.log
    
> Cassandra cluster becomes very slow and looses data after node failure
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-4274
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4274
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.0.8
>         Environment: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41) (ben@decadent.org.uk)
(gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon Jan 16 16:22:28 UTC 2012
> Debian GNU/Linux 6.0
> Cassandra 1.0.8 Debian package
>            Reporter: Vassil Hristov
>         Attachments: catalina.2012-05-23.log, output.192.168.1.7.log, output.192.168.1.8.log,
system.192.168.1.7.log, system.192.168.1.8.log
>
>
> Hi,
> in a nutshell: today we experienced a problem with one of our clusters. Our application
became very slow and it turned out to be caused by Cassandra. Additionally, some data was
not persisted properly. A reboot of one of the nodes fixed the problem, in terms of that the
application is now responsive again and data is written properly, lost data was not recovered.
> Now some more details.
> The setup: we have 2 nodes, running Cassandra 1.0.8. In our java application, we use
Hector to connect to the nodes. We store some log data in cassandra. The relevant method looks
like this:
> {{  storeMessage(mutator, key, message, ttl);
>   storeMessageInIndex(mutator, key, message, ttl);}}
> In the first method, the entire message is stored in the column family cfMainData under
the provided key, and in the second we maintain a manual index, which is stored in a different
column family (cfDateOrderedMessages) under the same key.
> The problem: our support reported that certain operations take extremely long (200+ seconds,
compared to the usual <1 second). According to {{nodetool ring}} both nodes were up and
running. After checking the data, some of it was not accessible. What's really odd though
is that the index was maintained properly, while the main data was missing. That is, the index
would hold a key X, but cfMainData[X] would return no results.
> After the restart of one of the nodes (192.186.1.7 for the log reference), everything
went back to normal and now all is working correctly. 
> I am well aware that it's very likely that you won't be able to reproduce the problem
(we cannot either). However, maybe you'll figure out why the 'broken' node wasn't marked as
such. The behaviour I would have expected is that all writes would fail, since quorum cannot
be reached. The result would again be lost data, but it would be a more consistent behaviour.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message