Hi Mark,

Thanks for sharing your experience. We are quite surprised to these kind of issues, especially our data volume is super tiny (50 concurrent, 10GB data on 5 nodes). Unfortunately we don't have enough time or resource to keep it stable in Production, we are pulling the plug today. Back to MySQL.

Still, we will definitely check back later.

alex

Mark Jones wrote:

I’ve had a compaction take more than 500 seconds, and I see this same stumbling you do.  I’ve adapted my client to know about all the nodes in the ring, and when one of them starts doing this, they try the next node in their list.  Problem is that I need to randomize which node they choose (yet still try all the nodes), but the stumbling happens often enough before I get to 70 million inserts, all of them are sending all their inserts to the same node.  Looking at the logs, they start off randomly connected (spreading the load) and shortly after that, they are all walking from node to node as a group.  (I have a 4 node cluster at the moment)

 

From: Alex Li [mailto:alexli@gmail.com]
Sent: Thursday, April 22, 2010 9:36 PM
To: user@cassandra.apache.org
Subject: Re: Periodically hiccups

 

Thank you for the tip about microseconds, the stats matches then. Thank you!

Two more things happened:

    - One compaction took 19s, is that normal?
    - The whole ring was having trouble for a while, any ideas?

Thanks.

 INFO [FLUSH-WRITER-POOL:1] 2010-04-22 19:13:13,247 Memtable.java (line 162) Completed flushing /var/lib/cassandra/data/Pandora/Folder-1893-Data.db
 INFO [COMPACTION-POOL:1] 2010-04-22 19:13:13,248 CompactionManager.java (line 246) Compacting [org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Pandora/Folder-1890-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Pandora/Folder-1891-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Pandora/Folder-1892-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Pandora/Folder-1893-Data.db')]
 INFO [COMPACTION-POOL:1] 2010-04-22 19:13:32,425 CompactionManager.java (line 326) Compacted to /var/lib/cassandra/data/Pandora/Folder-1894-Data.db.  71368197/53308707 bytes for 45168 keys.  Time: 18994ms.


INFO [Timer-1] 2010-04-22 19:30:50,239 Gossiper.java (line 179) InetAddress /10.10.104.16 is now dead.
 INFO [Timer-1] 2010-04-22 19:30:50,239 Gossiper.java (line 179) InetAddress /10.10.104.15 is now dead.
 INFO [Timer-1] 2010-04-22 19:30:50,240 Gossiper.java (line 179) InetAddress /10.10.104.13 is now dead.
 INFO [Timer-1] 2010-04-22 19:30:50,240 Gossiper.java (line 179) InetAddress /10.10.104.11 is now dead.
 INFO [GMFD:1] 2010-04-22 19:30:51,622 Gossiper.java (line 568) InetAddress /10.10.104.15 is now UP
 INFO [GMFD:1] 2010-04-22 19:30:51,622 Gossiper.java (line 568) InetAddress /10.10.104.11 is now UP
 INFO [GMFD:1] 2010-04-22 19:30:51,665 Gossiper.java (line 568) InetAddress /10.10.104.13 is now UP
 INFO [GMFD:1] 2010-04-22 19:30:55,377 Gossiper.java (line 568) InetAddress /10.10.104.16 is now UP
 INFO [Timer-1] 2010-04-22 19:31:15,310 Gossiper.java (line 179) InetAddress /10.10.104.16 is now dead.
 INFO [GMFD:1] 2010-04-22 19:31:15,866 Gossiper.java (line 568) InetAddress /10.10.104.16 is now UP


Alex Li wrote:

Hello,

We recently deployed a cluster of 5 Cassandra nodes into production, and ran into big problems with periodically hiccups (individual node goes down, high CPU, client connection timeout). It was terrible with 0.5 (one hiccups every 5-10 minutes), today we upgraded to 0.6.1, it happens less frequently now (likely once every 30 minutes or so). But it is still quite frustrating.

We used ReplicationFactor=3 for all column families. 5 nodes are behind haproxy. Java client goes through haproxy. The most obvious behavior is: as soon as one node goes down, the connections between haproxy and Cassandra nodes just shoot up to 1000 (in normal case it is stable at 40, which should be really trivial for Cassandra), and the connections don't go down for quite a while. Meanwhile Java clients just get all kind of TimeoutException, then kept on retrying. Eventually we have to restart haproxy, then things go back to normal.

Each node has 5GB max heap, powerful enough CPU (quad-core), software RAID mirror. We are definitely NOT putting lots of load yet, mostly 20-50 concurrent requests to Cassandra, but it is not holding up! Please help, we are on the verge of giving up Cassandra after 5 days of periodic "outage".

Couple observations:

- cfstats shows significant read latency on "system" keyspace, almost 5s (see below)

- RecentReadLatencyMicros and RecentWriteLatencyMicros are super high for StorageProxy, as well as every column family in JMX: up to 152676.92 and 6950 (they are in ms, right?). However, in cfstats, they are quite small.

- Every second we see 5-10 DigestMismatchException in the log:

INFO [pool-1-thread-15857] 2010-04-22 00:37:37,887 StorageProxy.java (line 499) DigestMismatchException: Mismatch for key 1068022523 (d41d8cd98f00b204e9800998ecf8427e vs 0dd4cdaeeb1a334ae133c6955e109629)

Please advice. Thank you!

storage-conf, cfstats, tpstats are listed below:

root@cdb-006:/glass/sfw/cassandra# cat conf/storage-conf.xml

<Storage>

   <AutoBootstrap>false</AutoBootstrap>
   <Keyspaces>
       <Keyspace Name="Pandora">
  
           <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
           <ReplicationFactor>3</ReplicationFactor>
           <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
       </Keyspace>
       </Keyspaces>


   <RpcTimeoutInMillis>10000</RpcTimeoutInMillis>
   <CommitLogRotationThresholdInMB>128</CommitLogRotationThresholdInMB>

   <DiskAccessMode>auto</DiskAccessMode>
   <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
   <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
   <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
   <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
   <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
   <MemtableThroughputInMB>64</MemtableThroughputInMB>
   <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
   <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
      <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>

   <!--
              ~ Unlike most systems, in Cassandra writes are faster than reads, so
              ~ you can afford more of those in parallel.  A good rule of thumb is 2
              ~ concurrent reads per processor core.  Increase ConcurrentWrites to
              ~ the number of clients writing at once if you enable CommitLogSync +
              ~ CommitLogSyncDelay. -->
   <ConcurrentReads>8</ConcurrentReads>
   <ConcurrentWrites>32</ConcurrentWrites>

   <!--
              ~ CommitLogSync may be either "periodic" or "batch."  When in batch
              ~ mode, Cassandra won't ack writes until the commit log has been
              ~ fsynced to disk.  It will wait up to CommitLogSyncBatchWindowInMS
              ~ milliseconds for other writes, before performing the sync.

              ~ This is less necessary in Cassandra than in traditional databases
              ~ since replication reduces the odds of losing data from a failure
              ~ after writing the log entry but before it actually reaches the disk.
              ~ So the other option is "periodic," where writes may be acked immediately
              ~ and the CommitLog is simply synced every CommitLogSyncPeriodInMS
              ~ milliseconds.
             -->
   <CommitLogSync>periodic</CommitLogSync>
   <!--
              ~ Interval at which to perform syncs of the CommitLog in periodic mode.
              ~ Usually the default of 10000ms is fine; increase it if your i/o
              ~ load is such that syncs are taking excessively long times.
             -->
   <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
   <!--
              ~ Delay (in milliseconds) during which additional commit log entries
              ~ may be written before fsync in batch mode.  This will increase
              ~ latency slightly, but can vastly improve throughput where there are
              ~ many writers.  Set to zero to disable (each entry will be synced
              ~ individually).  Reasonable values range from a minimal 0.1 to 10 or
              ~ even more if throughput matters more than latency.
             -->
   <!-- <CommitLogSyncBatchWindowInMS>1</CommitLogSyncBatchWindowInMS> -->

   <!--
              ~ Time to wait before garbage-collection deletion markers.  Set this to
              ~ a large enough value that you are confident that the deletion marker
              ~ will be propagated to all replicas by the time this many seconds has
              ~ elapsed, even in the face of hardware failures.  The default value is
              ~ ten days.
             -->
   <GCGraceSeconds>864000</GCGraceSeconds>
</Storage>
root@cdb-006:/glass/sfw/cassandra# bin/nodetool -h localhost cfstats
Keyspace: system
   Read Count: 878
   Read Latency: 5752.042634396355 ms.
   Write Count: 2260398
   Write Latency: 0.014567047926957996 ms.
   Pending Tasks: 0
       Column Family: LocationInfo
       SSTable count: 2
       Space used (live): 3569
       Space used (total): 3569
       Memtable Columns Count: 0
       Memtable Data Size: 0
       Memtable Switch Count: 1
       Read Count: 1
       Read Latency: NaN ms.
       Write Count: 6
       Write Latency: NaN ms.
       Pending Tasks: 0
       Key cache capacity: 2
       Key cache size: 1
       Key cache hit rate: NaN
       Row cache: disabled
       Compacted row minimum size: 0
       Compacted row maximum size: 0
       Compacted row mean size: 0

       Column Family: HintsColumnFamily
       SSTable count: 2
       Space used (live): 70272035
       Space used (total): 70272035
       Memtable Columns Count: 56264
       Memtable Data Size: 486854
       Memtable Switch Count: 21
       Read Count: 877
       Read Latency: 13614.412 ms.
       Write Count: 2260392
       Write Latency: 0.142 ms.
       Pending Tasks: 0
       Key cache capacity: 2
       Key cache size: 2
       Key cache hit rate: 0.25
       Row cache: disabled
       Compacted row minimum size: 78567
       Compacted row maximum size: 39561901
       Compacted row mean size: 27878603

----------------
Keyspace: Titan
   Read Count: 8948702
   Read Latency: 7.949136100185256 ms.
   Write Count: 3393490
   Write Latency: 0.19255415398306758 ms.
   Pending Tasks: 0
       Column Family: FbUser
       SSTable count: 6
       Space used (live): 3675014807
       Space used (total): 3675014807
       Memtable Columns Count: 250055
       Memtable Data Size: 9146339
       Memtable Switch Count: 9
       Read Count: 6591406
       Read Latency: 8.361 ms.
       Write Count: 343030
       Write Latency: 0.078 ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 200000
       Key cache hit rate: 0.6341912478864628
       Row cache: disabled
       Compacted row minimum size: 320
       Compacted row maximum size: 586
       Compacted row mean size: 479

       Column Family: Payment
       SSTable count: 3
       Space used (live): 2473
       Space used (total): 2473
       Memtable Columns Count: 0
       Memtable Data Size: 0
       Memtable Switch Count: 0
       Read Count: 27728
       Read Latency: 0.059 ms.
       Write Count: 0
       Write Latency: NaN ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 0
       Key cache hit rate: NaN
       Row cache: disabled
       Compacted row minimum size: 0
       Compacted row maximum size: 0
       Compacted row mean size: 0

       Column Family: Club
       SSTable count: 6
       Space used (live): 363609256
       Space used (total): 477968226
       Memtable Columns Count: 124134
       Memtable Data Size: 38144903
       Memtable Switch Count: 140
       Read Count: 63668
       Read Latency: 29.187 ms.
       Write Count: 1966098
       Write Latency: 0.184 ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 63751
       Key cache hit rate: 0.6543073524771221
       Row cache: disabled
       Compacted row minimum size: 274
       Compacted row maximum size: 49784
       Compacted row mean size: 7711

       Column Family: RandomUsers
       SSTable count: 1
       Space used (live): 1329464
       Space used (total): 1329464
       Memtable Columns Count: 170433
       Memtable Data Size: 5525529
       Memtable Switch Count: 8
       Read Count: 58236
       Read Latency: 6.215 ms.
       Write Count: 143339
       Write Latency: 0.575 ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 21
       Key cache hit rate: 1.0
       Row cache: disabled
       Compacted row minimum size: 1244196
       Compacted row maximum size: 1327409
       Compacted row mean size: 1287929

       Column Family: User
       SSTable count: 4
       Space used (live): 583812728
       Space used (total): 583812728
       Memtable Columns Count: 248833
       Memtable Data Size: 7909687
       Memtable Switch Count: 12
       Read Count: 2207667
       Read Latency: 8.413 ms.
       Write Count: 941029
       Write Latency: 0.107 ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 94940
       Key cache hit rate: 0.9769078238427732
       Row cache: disabled
       Compacted row minimum size: 259
       Compacted row maximum size: 174340
       Compacted row mean size: 3916

----------------
Keyspace: Pandora
   Read Count: 1475530
   Read Latency: 5.249669284257182 ms.
   Write Count: 856550
   Write Latency: 0.16070848053236822 ms.
   Pending Tasks: 0
       Column Family: Folder
       SSTable count: 7
       Space used (live): 942926793
       Space used (total): 942926793
       Memtable Columns Count: 150119
       Memtable Data Size: 814194
       Memtable Switch Count: 21
       Read Count: 604605
       Read Latency: 7.539 ms.
       Write Count: 553015
       Write Latency: 0.173 ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 111884
       Key cache hit rate: 0.9770064886992118
       Row cache: disabled
       Compacted row minimum size: 234
       Compacted row maximum size: 1644541
       Compacted row mean size: 1238

       Column Family: Attachment
       SSTable count: 5
       Space used (live): 754692822
       Space used (total): 754692822
       Memtable Columns Count: 548
       Memtable Data Size: 7110
       Memtable Switch Count: 8
       Read Count: 22708
       Read Latency: 6.950 ms.
       Write Count: 29835
       Write Latency: 0.025 ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 8299
       Key cache hit rate: 0.017857142857142856
       Row cache: disabled
       Compacted row minimum size: 269
       Compacted row maximum size: 274
       Compacted row mean size: 272

       Column Family: Message
       SSTable count: 7
       Space used (live): 3689251440
       Space used (total): 3689251440
       Memtable Columns Count: 50399
       Memtable Data Size: 1361773
       Memtable Switch Count: 10
       Read Count: 163630
       Read Latency: 16.981 ms.
       Write Count: 257843
       Write Latency: 0.113 ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 52427
       Key cache hit rate: 0.5041365046535677
       Row cache: disabled
       Compacted row minimum size: 272
       Compacted row maximum size: 2754
       Compacted row mean size: 752

       Column Family: FolderInfo
       SSTable count: 0
       Space used (live): 0
       Space used (total): 0
       Memtable Columns Count: 0
       Memtable Data Size: 0
       Memtable Switch Count: 0
       Read Count: 8704
       Read Latency: 0.008 ms.
       Write Count: 0
       Write Latency: NaN ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 0
       Key cache hit rate: NaN
       Row cache: disabled
       Compacted row minimum size: 0
       Compacted row maximum size: 0
       Compacted row mean size: 0

       Column Family: StandardByUUID1
       SSTable count: 0
       Space used (live): 0
       Space used (total): 0
       Memtable Columns Count: 0
       Memtable Data Size: 0
       Memtable Switch Count: 0
       Read Count: 8704
       Read Latency: 0.005 ms.
       Write Count: 0
       Write Latency: NaN ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 0
       Key cache hit rate: NaN
       Row cache: disabled
       Compacted row minimum size: 0
       Compacted row maximum size: 0
       Compacted row mean size: 0

       Column Family: User
       SSTable count: 4
       Space used (live): 120121309
       Space used (total): 120121309
       Memtable Columns Count: 1786
       Memtable Data Size: 50657
       Memtable Switch Count: 8
       Read Count: 641067
       Read Latency: 2.728 ms.
       Write Count: 15857
       Write Latency: 0.148 ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 45965
       Key cache hit rate: 0.957460499034818
       Row cache: disabled
       Compacted row minimum size: 271
       Compacted row maximum size: 506
       Compacted row mean size: 499

       Column Family: Standard1
       SSTable count: 0
       Space used (live): 0
       Space used (total): 0
       Memtable Columns Count: 0
       Memtable Data Size: 0
       Memtable Switch Count: 0
       Read Count: 8704
       Read Latency: 0.005 ms.
       Write Count: 0
       Write Latency: NaN ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 0
       Key cache hit rate: NaN
       Row cache: disabled
       Compacted row minimum size: 0
       Compacted row maximum size: 0
       Compacted row mean size: 0

       Column Family: Standard2
       SSTable count: 0
       Space used (live): 0
       Space used (total): 0
       Memtable Columns Count: 0
       Memtable Data Size: 0
       Memtable Switch Count: 0
       Read Count: 8704
       Read Latency: 0.006 ms.
       Write Count: 0
       Write Latency: NaN ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 0
       Key cache hit rate: NaN
       Row cache: disabled
       Compacted row minimum size: 0
       Compacted row maximum size: 0
       Compacted row mean size: 0

       Column Family: Folder1
       SSTable count: 0
       Space used (live): 0
       Space used (total): 0
       Memtable Columns Count: 0
       Memtable Data Size: 0
       Memtable Switch Count: 0
       Read Count: 8704
       Read Latency: 0.005 ms.
       Write Count: 0
       Write Latency: NaN ms.
       Pending Tasks: 0
       Key cache capacity: 200000
       Key cache size: 0
       Key cache hit rate: NaN
       Row cache: disabled
       Compacted row minimum size: 0
       Compacted row maximum size: 0
       Compacted row mean size: 0

----------------
root@cdb-006:/glass/sfw/cassandra# bin/nodetool -h localhost tpstats
Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0             70
STREAM-STAGE                      0         0              0
RESPONSE-STAGE                    0         0       18298968
ROW-READ-STAGE                    1         0       10168985
LB-OPERATIONS                     0         0              0
MESSAGE-DESERIALIZER-POOL         0         0       24960681
GMFD                              0         0          93647
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0              0
ROW-MUTATION-STAGE                0         0        4280703
MESSAGE-STREAMING-POOL            0         0             12
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             0         0            224
FLUSH-WRITER-POOL                 0         0            224
AE-SERVICE-STAGE                  0         0             22
HINTED-HANDOFF-POOL               1         5             73

 

Here are some system load data I collected.

top - 00:43:43 up 1 day, 20:15,  2 users,  load average: 2.96, 2.74, 2.37
Tasks: 160 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
Cpu0  : 10.9%us,  2.6%sy,  0.0%ni, 83.2%id,  3.2%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  9.6%us,  2.2%sy,  0.0%ni, 85.4%id,  2.8%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 37.1%us,  4.2%sy,  0.0%ni, 23.3%id, 33.8%wa,  1.2%hi,  0.3%si,  0.0%st
Cpu3  : 18.5%us,  3.9%sy,  0.0%ni, 62.6%id,  9.5%wa,  4.1%hi,  1.4%si,  0.0%st
Cpu4  : 11.1%us,  2.6%sy,  0.0%ni, 80.1%id,  3.2%wa,  2.9%hi,  0.2%si,  0.0%st
Cpu5  : 10.1%us,  2.7%sy,  0.0%ni, 84.4%id,  2.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 20.1%us,  2.6%sy,  0.0%ni, 62.6%id, 14.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 18.9%us,  4.4%sy,  0.0%ni, 69.0%id,  7.7%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   8188108k total,  8135168k used,    52940k free,    12968k buffers
Swap: 62500792k total,   277484k used, 62223308k free,  2381704k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                  
18000 root      20   0 15.6g 6.7g 1.5g S   64 85.4 552:54.13 java                                                                     
    1 root      20   0 19316  672  412 S    0  0.0   0:03.36 init       

root@cdb-006:/glass/sfw/cassandra# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 5  1 277984  50180  12888 2382292   84   67   880   127   10    4 17  4 69 10
 3  1 278000  51536  12856 2381316    4   14  2556    37 8660 10714  7  4 82  8
 2  0 277968  52084  12768 2380156    7    2   891   618 6287 8865  6  3 87  3
 3  0 277872  52020  12764 2379412   24    7  1070    22 5492 10111  8  3 84  4
 0  1 277528  50764  12760 2379176    2   10  1530   696 8773 13661  8  4 83  5