cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: Problems recovering a dead node
Date Wed, 04 May 2011 09:02:31 GMT
Certainly sounds a bit sick. 

The first error looks like it happens when the index file points to the wrong place in the
data file for the SSTable. The second one happens when the index file is corrupted. The should
be problems nodetool scrub can fix.

The disk space may be dead space to cassandra compaction or some other streaming failure.
You can check how much it considers to be live (in use) space using nodetool cfstats. This
will also tell you how many sstables are live. Having a lot of dead SSTables is not necessarily
a bad thing. 

What are the pending tasks ? what is nodetool tpstats showing ? And what does nodetool ring
show from one of the other nodes ? 

I'm assuming there are no errors in the logs on the node. What are the most recent INFO messages?

Hope that helps. 

Aaron Morton
Freelance Cassandra Developer

On 4 May 2011, at 17:54, Héctor Izquierdo Seliva wrote:

> Hi Aaron
> It has no data files whatsoever. The upgrade path is 0.7.4 -> 0.7.5. It
> turns out the initial problem was the sw raid failing silently because
> of another faulty disk.
> Now that the storage is working, I brought up the node again, same IP,
> same token and tried doing nodetool repair. 
> All adjacent nodes have finished the streaming session, and now the node
> has a total of 248 GB of data. Is this normal when the load per node is
> about 18GB? 
> Also there are 1245 pending tasks. It's been compacting or rebuilding
> sstables for the last 8 hours non stop. There are 2057 sstables in the
> data folder.
> Should I have done thing differently or is this the normal behaviour?
> Thanks!
> El mié, 04-05-2011 a las 07:54 +1200, aaron morton escribió:
>> When you say "it's clean" does that mean the node has no data files ?
>> After you replaced the disk what process did you use to recover  ?
>> Also what version are you running and what's the recent upgrade history ?
>> Cheers
>> Aaron
>> On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote:
>>> Hi everyone. One of the nodes in my 6 node cluster died with disk
>>> failures. I have replaced the disks, and it's clean. It has the same
>>> configuration (same ip, same token).
>>> When I try to restart the node it starts to throw mmap underflow
>>> exceptions till it closes again.
>>> I tried setting io to standard, but it still fails. It gives errors
>>> about two decorated keys being different, and the EOFException.
>>> Here is an excerpt of the log
>>> I can provide more info if needed. I'm at a loss here so any help is
>>> appreciated.
>>> Thanks all for your time
>>> Héctor Izquierdo

View raw message