incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ran Tavory <ran...@gmail.com>
Subject Nodes getting slowed down after a few days of smooth operation
Date Mon, 11 Oct 2010 14:13:30 GMT
In my production cluster I've been seeing the following pattern.
When a node goes up it operates smoothly for a few days but then, after a
few days the node start to show excessive CPU usage, I see GC activity (and
it may also be excessive, not sure) and sometimes the node is dropped out of
the ring due to unresponsiveness. If I let things go untouched for a few
more days eventually all nodes in the cluster end up serving slow reads and
clients start timing out. A node restart solves the problem for a few more
days.
So far the only trigger for this behavior is "number of days that have
passed since last node restart",  usually it's 4-5 days. One easy solution
is to restart nodes every couple of days but that's lame...

My current theory is this:
After a node gets restarted it compacts the sstable files on disk. I'm not
sure whether compactions always take place after restart, maybe it's just
minor compactions, I'm a little confused here, but my story would work best
if (major) compactions were always to run at server restart.
Then as the node operates, after a few days, enough sstable files start to
pile up and the node's filesystem cache gets full. This doesn't happen on
all the nodes, just in one DC in which I have hosts with 16G ram as opposed
to the other DC where I have 32G ram.
When the filesystem cache is full bad things start to happen. It doesn't
start right away, it may take 2-3 more days of high CPU and more or less
full fscache until shit starts to happen but eventually I start seeing more
GC activity, tpstats starts showing long list of pending tasks (usually the
row-read-stage), nodes get kicked out and back into the ring b/c they are
too busy to respond and clients start to time out or just have very bad
response times.

So I'm thinking what's the right thing to do?
If I increase memtable size would that help? (it'll take me a week to
experiment...). I'm not sure it'll actually help b/c at the end of the day
the filesystem cache will still get full, only that this time it'll be a
smaller number of larger files.
Actually I'm only evidently asserting that it's about the filesystem cache
getting full but I'm not sure this is the root cause, so what else could it
be? Maybe it's the sheer number of sstable files? At first I was thinking a
memory leak but I haven't seen any evidence of that.
If I run nodetool compact in a crontab every day instead of restart, would
that help? (that'll be lame too, I know...)
The documentation at http://wiki.apache.org/cassandra/MemtableSSTable states
that there's a number N of minimum files for compaction, default 4. Should I
try to decrease this number to maybe 2? (where is this setting anyway, I
don't think it's in storage-config, is this code only?)
Shooting in the dark, perhaps decreasing GCGraceSeconds from 10 days to 1
day will have a positive effect?



-------
Number of hosts: 6 in two DCs, 3 in each. RF=2, one copy of data in each DC.
Current version is 0.6.5 but it'd been like that with 0.6.2 and possibly
even before.

$ nodetool -port 9004 -host cass1 ring
Address       Status     Load          Range
     Ring

170141183460469231731687303715884105727
192.168.252.88Up         10.07 GB
 28356863910078205288614550619314017621     |<--|
192.168.254.57Up         11.2 GB
56713727820156410577229101238628035242     |   ^
192.168.252.124Up         10.03 GB
 85070591730234615865843651857942052863     v   |
192.168.254.58Up         11.26 GB
 113427455640312821154458202477256070484    |   ^
192.168.252.125Up         10.1 GB
141784319550391026443072753096570088105    v   |
192.168.254.59Up         11.41 GB
 170141183460469231731687303715884105727    |-->|

$ cat /proc/sys/vm/swappiness
0

$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

JVM_OPTS=" \
        -ea \
        -Xms4G \
        -Xmx6G \
        -XX:+UseParNewGC \
        -XX:+UseConcMarkSweepGC \
        -XX:+CMSParallelRemarkEnabled \
        -XX:SurvivorRatio=8 \
        -XX:MaxTenuringThreshold=1 \
        -XX:CMSInitiatingOccupancyFraction=80 \
        -XX:+HeapDumpOnOutOfMemoryError \
        -Dcom.sun.management.jmxremote.port=9004 \
        -Dcom.sun.management.jmxremote.ssl=false \
        -Dcom.sun.management.jmxremote.authenticate=false"

  <HintedHandoffEnabled>false</HintedHandoffEnabled>
...
      <ColumnFamily CompareWith="BytesType" Name="KvAds"
                    KeysCached="0"
                    RowsCached="10000000"/>

 <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
      <ReplicationFactor>2</ReplicationFactor>

 <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
    </Keyspace>
...
  <Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner>
...
  <DiskAccessMode>standard</DiskAccessMode>
  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>

  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>

  <MemtableThroughputInMB>64</MemtableThroughputInMB>
  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>

  <ConcurrentReads>8</ConcurrentReads>
  <ConcurrentWrites>32</ConcurrentWrites>

  <CommitLogSync>periodic</CommitLogSync>
  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>


Thanks!
-- 
/Ran

Mime
View raw message