incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Potekhin <>
Subject Re: Repair taking a long, long time
Date Tue, 19 Jul 2011 21:18:37 GMT
Thanks Edward. I'm told by our IT that the switch connecting the nodes 
is pretty fast.
Seriously, in my house I copy complete DVD images from my bedroom to
the living room downstairs via WiFi, and a dozen of GB does not seem like a
problem, on dirt cheap hardware (Patriot Box Office).

I also have just _one_ column major family but caveat emptor -- 8 
indexes attached to
it (and there will be more). There is one accounting CF which is small, 
can't possibly
make a difference.

By contrast, compaction (as in nodetool) performs quite well on this 
cluster. I start suspecting some
sort of malfunction.

Looked at the system log during the "repair", there is some compaction 
agent doing
work that I'm not sure makes sense (and I didn't call for it). Disk 
utilization all of a sudden goes up to 40%
per Ganglia, and stays there, this is pretty silly considering the 
cluster is IDLE and we have SSDs. No external writes,
no reads. There are occasional GC stoppages, but these I can live with.

This repair debacle happens 2nd time in a row. Cr@p. I need to go to 
production soon
and that doesn't look good at all. If I can't manage a system that 
simple (and/or get help
on this list) I may have to cut losses i.e. stay with Oracle.



On 7/19/2011 12:16 PM, Edward Capriolo wrote:
> Well most SSD's are pretty fast. There is one more to consider. If 
> Cassandra determines nodes are out of sync it has to transfer data 
> across the network. If that is the case you have to look at 'nodetool 
> streams' and determine how much data is being transferred between 
> nodes. There are some open tickets where with larger tables repair is 
> streaming more then it needs to. But even if the transfers are only 
> 10% of your 200GB. Transferring 20 GB is not trivial.
> If you have multiple keyspaces and column families repair one at a 
> time might make the process more manageable.

View raw message