Yes look at cassandra.yaml there is a section about throttling compaction. You still *want* multi-threaded compaction. Throttling will occur across all threads. The reason being is that you don't want to get stuck compacting bigger files, while the smaller ones build up waiting for bigger compaction to finish. This will slowly degrade read performance.

On Mon, Oct 3, 2011 at 1:19 PM, Ramesh Natarajan <ramesh25@gmail.com> wrote:
Thanks for the pointers.  I checked the system and the iostat showed that we are saturating the disk to 100%. The disk is SCSI device exposed by ESXi and it is running on a dedicated lun as RAID10 (4 600GB 15k drives) connected to ESX host via iSCSI.  

When I run compactionstats I see we are compacting a column family which has about 10GB of data. During this time I also see dropped messages in the system.log file. 

Since my io rates are constant in my tests I think the compaction is throwing things off.  Is there a way I can throttle compaction on cassandra?   Rather than run multiple compaction run at the same time, i would like to throttle it by io rate.. It is possible?

If instead of having 5 big column families, if I create say 1000 each (5000 total), do you think it will help me in this case? ( smaller files and so smaller load on compaction )

Is it normal to have 5000 column families?

thanks
Ramesh



On Mon, Oct 3, 2011 at 2:50 PM, Chris Goffinet <cg@chrisgoffinet.com> wrote:
Most likely what could be happening is you are running single threaded compaction. Look at the cassandra.yaml of how to enable multi-threaded compaction. As more data comes into the system, bigger files get created during compaction. You could be in a situation where you might be compacting at a higher bucket N level, and compactions build up at lower buckets.

Run "nodetool -host localhost compactionstats" to get an idea of what's going on.


On Mon, Oct 3, 2011 at 12:05 PM, Mohit Anchlia <mohitanchlia@gmail.com> wrote:
In order to understand what's going on you might want to first just do
write test, look at the results and then do just the read tests and
then do both read / write tests.

Since you mentioned high update/deletes I should also ask your CL for
writes/reads? with high updates/delete + high CL I think one should
expect reads to slow down when sstables have not been compacted.

You have 20G space and 17G is used by your process and I also see 36G
VIRT which I don't really understand why it's that high when swap is
disabled. Look at sar -r output too to make sure there are no swaps
occurring. Also, verify jna.jar is installed.

On Mon, Oct 3, 2011 at 11:52 AM, Ramesh Natarajan <ramesh25@gmail.com> wrote:
> I will start another test run to collect these stats. Our test model is in
> the neighborhood of  4500 inserts, 8000 updates&deletes and 1500 reads every
> second across 6 servers.
> Can you elaborate more on reducing the heap space? Do you think it is a
> problem with 17G RSS?
> thanks
> Ramesh
>
>
> On Mon, Oct 3, 2011 at 1:33 PM, Mohit Anchlia <mohitanchlia@gmail.com>
> wrote:
>>
>> I am wondering if you are seeing issues because of more frequent
>> compactions kicking in. Is this primarily write ops or reads too?
>> During the period of test gather data like:
>>
>> 1. cfstats
>> 2. tpstats
>> 3. compactionstats
>> 4. netstats
>> 5. iostat
>>
>> You have RSS memory close to 17gb. Maybe someone can give further
>> advise if that could be because of mmap. You might want to lower your
>> heap size to 6-8G and see if that helps.
>>
>> Also, check if you have jna.jar deployed and you see malloc successful
>> message in the logs.
>>
>> On Mon, Oct 3, 2011 at 10:36 AM, Ramesh Natarajan <ramesh25@gmail.com>
>> wrote:
>> > We have 5 CF.  Attached is the output from the describe command.  We
>> > don't
>> > have row cache enabled.
>> > Thanks
>> > Ramesh
>> > Keyspace: MSA:
>> >   Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
>> >   Durable Writes: true
>> >     Options: [replication_factor:3]
>> >   Column Families:
>> >     ColumnFamily: admin
>> >       Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>> >       Default column value validator:
>> > org.apache.cassandra.db.marshal.UTF8Type
>> >       Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>> >       Row cache size / save period in seconds: 0.0/0
>> >       Key cache size / save period in seconds: 200000.0/14400
>> >       Memtable thresholds: 0.5671875/1440/121 (millions of
>> > ops/minutes/MB)
>> >       GC grace seconds: 3600
>> >       Compaction min/max thresholds: 4/32
>> >       Read repair chance: 1.0
>> >       Replicate on write: true
>> >       Built indexes: []
>> >     ColumnFamily: modseq
>> >       Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>> >       Default column value validator:
>> > org.apache.cassandra.db.marshal.UTF8Type
>> >       Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>> >       Row cache size / save period in seconds: 0.0/0
>> >       Key cache size / save period in seconds: 500000.0/14400
>> >       Memtable thresholds: 0.5671875/1440/121 (millions of
>> > ops/minutes/MB)
>> >       GC grace seconds: 3600
>> >       Compaction min/max thresholds: 4/32
>> >       Read repair chance: 1.0
>> >       Replicate on write: true
>> >       Built indexes: []
>> >     ColumnFamily: msgid
>> >       Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>> >       Default column value validator:
>> > org.apache.cassandra.db.marshal.UTF8Type
>> >       Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>> >       Row cache size / save period in seconds: 0.0/0
>> >       Key cache size / save period in seconds: 500000.0/14400
>> >       Memtable thresholds: 0.5671875/1440/121 (millions of
>> > ops/minutes/MB)
>> >       GC grace seconds: 864000
>> >       Compaction min/max thresholds: 4/32
>> >       Read repair chance: 1.0
>> >       Replicate on write: true
>> >       Built indexes: []
>> >     ColumnFamily: participants
>> >       Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>> >       Default column value validator:
>> > org.apache.cassandra.db.marshal.UTF8Type
>> >       Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>> >       Row cache size / save period in seconds: 0.0/0
>> >       Key cache size / save period in seconds: 500000.0/14400
>> >       Memtable thresholds: 0.5671875/1440/121 (millions of
>> > ops/minutes/MB)
>> >       GC grace seconds: 3600
>> >       Compaction min/max thresholds: 4/32
>> >       Read repair chance: 1.0
>> >       Replicate on write: true
>> >       Built indexes: []
>> >     ColumnFamily: uid
>> >       Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
>> >       Default column value validator:
>> > org.apache.cassandra.db.marshal.UTF8Type
>> >       Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
>> >       Row cache size / save period in seconds: 0.0/0
>> >       Key cache size / save period in seconds: 2000000.0/14400
>> >       Memtable thresholds: 0.4/1440/121 (millions of ops/minutes/MB)
>> >       GC grace seconds: 3600
>> >       Compaction min/max thresholds: 4/32
>> >       Read repair chance: 1.0
>> >       Replicate on write: true
>> >       Built indexes: []
>> >
>> >
>> >
>> >
>> > On Mon, Oct 3, 2011 at 12:26 PM, Mohit Anchlia <mohitanchlia@gmail.com>
>> > wrote:
>> >>
>> >> On Mon, Oct 3, 2011 at 10:12 AM, Ramesh Natarajan <ramesh25@gmail.com>
>> >> wrote:
>> >> > I am running a cassandra cluster of  6 nodes running RHEL6
>> >> > virtualized
>> >> > by
>> >> > ESXi 5.0.  Each VM is configured with 20GB of ram and 12 cores. Our
>> >> > test
>> >> > setup performs about 3000  inserts per second.  The cassandra data
>> >> > partition
>> >> > is on a XFS filesystem mounted with options
>> >> > (noatime,nodiratime,nobarrier,logbufs=8). We have no swap enabled on
>> >> > the
>> >> > VMs
>> >> > and the vm.swappiness set to 0. To avoid any contention issues our
>> >> > cassandra
>> >> > VMs are not running any other application other than cassandra.
>> >> > The test runs fine for about 12 hours or so. After that the
>> >> > performance
>> >> > starts to degrade to about 1500 inserts per sec. By 18-20 hours the
>> >> > inserts
>> >> > go down to 300 per sec.
>> >> > if i do a truncate, it starts clean, runs for a few hours (not as
>> >> > clean
>> >> > as
>> >> > rebooting).
>> >> > We find a direct correlation between kswapd kicking in after 12 hours
>> >> > or
>> >> > so
>> >> > and the performance degradation.   If i look at the cached memory it
>> >> > is
>> >> > close to 10G.  I am not getting a OOM error in cassandra. So looks
>> >> > like
>> >> > we
>> >> > are not running out of memory. Can some one explain if we can
>> >> > optimize
>> >> > this
>> >> > so that kswapd doesn't kick in.
>> >> >
>> >> > Our top output shows
>> >> > top - 16:23:54 up 2 days, 23:17,  4 users,  load average: 2.21, 2.08,
>> >> > 2.02
>> >> > Tasks: 213 total,   1 running, 212 sleeping,   0 stopped,   0 zombie
>> >> > Cpu(s):  1.6%us,  0.8%sy,  0.0%ni, 90.9%id,  6.3%wa,  0.0%hi,
>> >> >  0.2%si,
>> >> >  0.0%st
>> >> > Mem:  20602812k total, 20320424k used,   282388k free,     1020k
>> >> > buffers
>> >> > Swap:        0k total,        0k used,        0k free, 10145516k
>> >> > cached
>> >> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >> >
>> >> >  2586 root      20   0 36.3g  17g 8.4g S 32.1 88.9   8496:37 java
>> >> >
>> >> > java output
>> >> > root      2453     1 99 Sep30 pts/0    9-13:51:38 java -ea
>> >> > -javaagent:./apache-cassandra-0.8.6/bin/../lib/jamm-0.2.2.jar
>> >> > -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms10059M
>> >> > -Xmx10059M
>> >> > -Xmn1200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC
>> >> > -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>> >> > -XX:SurvivorRatio=8
>> >> > -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
>> >> > -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true
>> >> > -Dcom.sun.management.jmxremote.port=7199
>> >> > -Dcom.sun.management.jmxremote.ssl=false
>> >> > -Dcom.sun.management.jmxremote.authenticate=false
>> >> > -Djava.rmi.server.hostname=10.19.104.14
>> >> > -Djava.net.preferIPv4Stack=true
>> >> > -Dlog4j.configuration=log4j-server.properties
>> >> > -Dlog4j.defaultInitOverride=true -cp
>> >> >
>> >> >
>> >> > ./apache-cassandra-0.8.6/bin/../conf:./apache-cassandra-0.8.6/bin/../build/classes/main:./apache-cassandra-0.8.6/bin/../build/classes/thrift:./apache-cassandra-0.8.6/bin/../lib/antlr-3.2.jar:./apache-cassandra-0.8.6/bin/../lib/apache-cassandra-0.8.6.jar:./apache-cassandra-0.8.6/bin/../lib/apache-cassandra-thrift-0.8.6.jar:./apache-cassandra-0.8.6/bin/../lib/avro-1.4.0-fixes.jar:./apache-cassandra-0.8.6/bin/../lib/avro-1.4.0-sources-fixes.jar:./apache-cassandra-0.8.6/bin/../lib/commons-cli-1.1.jar:./apache-cassandra-0.8.6/bin/../lib/commons-codec-1.2.jar:./apache-cassandra-0.8.6/bin/../lib/commons-collections-3.2.1.jar:./apache-cassandra-0.8.6/bin/../lib/commons-lang-2.4.jar:./apache-cassandra-0.8.6/bin/../lib/concurrentlinkedhashmap-lru-1.1.jar:./apache-cassandra-0.8.6/bin/../lib/guava-r08.jar:./apache-cassandra-0.8.6/bin/../lib/high-scale-lib-1.1.2.jar:./apache-cassandra-0.8.6/bin/../lib/jackson-core-asl-1.4.0.jar:./apache-cassandra-0.8.6/bin/../lib/jackson-mapper-asl-1.4.0.jar:./apache-cassandra-0.8.6/bin/../lib/jamm-0.2.2.jar:./apache-cassandra-0.8.6/bin/../lib/jline-0.9.94.jar:./apache-cassandra-0.8.6/bin/../lib/json-simple-1.1.jar:./apache-cassandra-0.8.6/bin/../lib/libthrift-0.6.jar:./apache-cassandra-0.8.6/bin/../lib/log4j-1.2.16.jar:./apache-cassandra-0.8.6/bin/../lib/mx4j-examples.jar:./apache-cassandra-0.8.6/bin/../lib/mx4j-impl.jar:./apache-cassandra-0.8.6/bin/../lib/mx4j.jar:./apache-cassandra-0.8.6/bin/../lib/mx4j-jmx.jar:./apache-cassandra-0.8.6/bin/../lib/mx4j-remote.jar:./apache-cassandra-0.8.6/bin/../lib/mx4j-rimpl.jar:./apache-cassandra-0.8.6/bin/../lib/mx4j-rjmx.jar:./apache-cassandra-0.8.6/bin/../lib/mx4j-tools.jar:./apache-cassandra-0.8.6/bin/../lib/servlet-api-2.5-20081211.jar:./apache-cassandra-0.8.6/bin/../lib/slf4j-api-1.6.1.jar:./apache-cassandra-0.8.6/bin/../lib/slf4j-log4j12-1.6.1.jar:./apache-cassandra-0.8.6/bin/../lib/snakeyaml-1.6.jar
>> >> > org.apache.cassandra.thrift.CassandraDaemon
>> >> >
>> >> >
>> >> > Ring output
>> >> > [root@CAP4-CNode4 apache-cassandra-0.8.6]# ./bin/nodetool -h
>> >> > 127.0.0.1
>> >> > ring
>> >> > Address         DC          Rack        Status State   Load
>> >> >  Owns
>> >> >    Token
>> >> >
>> >> >    141784319550391026443072753096570088105
>> >> > 10.19.104.11    datacenter1 rack1       Up     Normal  19.92 GB
>> >> >  16.67%  0
>> >> > 10.19.104.12    datacenter1 rack1       Up     Normal  19.3 GB
>> >> > 16.67%  28356863910078205288614550619314017621
>> >> > 10.19.104.13    datacenter1 rack1       Up     Normal  18.57 GB
>> >> >  16.67%  56713727820156410577229101238628035242
>> >> > 10.19.104.14    datacenter1 rack1       Up     Normal  19.34 GB
>> >> >  16.67%  85070591730234615865843651857942052863
>> >> > 10.19.105.11    datacenter1 rack1       Up     Normal  19.88 GB
>> >> >  16.67%  113427455640312821154458202477256070484
>> >> > 10.19.105.12    datacenter1 rack1       Up     Normal  20 GB
>> >> > 16.67%  141784319550391026443072753096570088105
>> >> > [root@CAP4-CNode4 apache-cassandra-0.8.6]#
>> >>
>> >> How many CFs? can you describe CF and post the configuration? Do you
>> >> have row cache enabled?
>> >
>> >
>
>