cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Sargisson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-4583) Some nodes forget schema when 1 node fails
Date Wed, 29 Aug 2012 16:18:07 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Edward Sargisson updated CASSANDRA-4583:
----------------------------------------

    Description: 
At present we do not have a complete reproduction for this defect but am raising this defect
as request by Aaron Morton. We will update as we find out more. If any additional logging
or tests are requested we will do them if we can. 

We have experienced 2 failures ascribed to this defect. On the cassandra user mailing list
Peter Schuller (2012-08-28) describes an additional failure.

Reproduction steps as currently known:
1. Setup a cluster with 6 nodes (call them #1 through #6).
2. Have #5 fail completely. One failure was when the node was stopped to replace the battery
in the hard disk cache. The second failure was when the hardware monitoring recorded a problem,
CPU usage was increasing without explanation and the server console was frozen so the machine
was restarted.
3. Bring #5 back

Expected behaviour:
* #5 should rejoin the ring.

Actual behaviour (based on the incident we saw yesterday):
* #5 didn't rejoin the ring.
* We stopped all nodes and started them one by one.
* Nodes #2, #4, #6 had forgotten most of their column families. They had the keys space but
with only one column family instead of the usual 9 or so.
* We ran nodetool resetlocalschema on #2, #4 and #6.
* We ran nodetool repair -pr on #2, #4, #5 and #6
* On #2 nodetool repair appeared to crash in that there were no messages in the logs from
it for 10min+. Nodetool compactionstats and nodetool netstats showed no activity.
* Restarting nodetool repair -pr fixed the problem and ran to completion.



  was:
At present we do not have a complete reproduction for this defect but am raising this defect
as request by Aaron Morton. We will update as we find out more. If any additional logging
or tests are requested we will do them if we can. 

We have experienced 2 failures ascribed to this defect. On the cassandra user mailing list
Peter Schuller (2012-08-28) describes an additional failure.

Reproduction steps as currently known:
1. Setup a cluster with 6 nodes (call them #1 through #6).
2. Have #5 fail completely. One failure was when the node was stopped to replace the battery
in the hard disk cache. The second failure was when the hardware monitoring recorded a problem,
CPU usage was increasing without explanation and the server console was frozen so the machine
was restarted.
3. Bring #5 back

Expected behaviour:
* #5 should rejoin the ring.

Actual behaviour (based on the incident we saw yesterday):
* #5 didn't rejoin the ring.
* We stopped all nodes and started them one by one.
* Nodes #2, #4, #6 had forgotten most of their column families. They had the keys space but
with only one column family instead of the usual 9 or so.
* We ran nodetool resetlocalschema on #2, #4 and #6.
* We ran nodetool repair -pr on #2, #4, #5 and #6
* On one of these nodes nodetool repair appeared to crash in that there were no messages in
the logs from it for 10min+. Nodetool compactionstats and nodetool netstats showed no activity.
* Restarting nodetool repair -pr fixed the problem and ran to completion.



    
> Some nodes forget schema when 1 node fails
> ------------------------------------------
>
>                 Key: CASSANDRA-4583
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4583
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.2
>         Environment: CentOS release 6.3 (Final)
>            Reporter: Edward Sargisson
>
> At present we do not have a complete reproduction for this defect but am raising this
defect as request by Aaron Morton. We will update as we find out more. If any additional logging
or tests are requested we will do them if we can. 
> We have experienced 2 failures ascribed to this defect. On the cassandra user mailing
list Peter Schuller (2012-08-28) describes an additional failure.
> Reproduction steps as currently known:
> 1. Setup a cluster with 6 nodes (call them #1 through #6).
> 2. Have #5 fail completely. One failure was when the node was stopped to replace the
battery in the hard disk cache. The second failure was when the hardware monitoring recorded
a problem, CPU usage was increasing without explanation and the server console was frozen
so the machine was restarted.
> 3. Bring #5 back
> Expected behaviour:
> * #5 should rejoin the ring.
> Actual behaviour (based on the incident we saw yesterday):
> * #5 didn't rejoin the ring.
> * We stopped all nodes and started them one by one.
> * Nodes #2, #4, #6 had forgotten most of their column families. They had the keys space
but with only one column family instead of the usual 9 or so.
> * We ran nodetool resetlocalschema on #2, #4 and #6.
> * We ran nodetool repair -pr on #2, #4, #5 and #6
> * On #2 nodetool repair appeared to crash in that there were no messages in the logs
from it for 10min+. Nodetool compactionstats and nodetool netstats showed no activity.
> * Restarting nodetool repair -pr fixed the problem and ran to completion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message