The scenario:

 

five vms running the stress test:

/usr/bin/Cassandra-stress –n 260000 –replication-factor 3  -d 10.201.3.80

against three nodes of Cassandra. I ran a series of test, raising the number of inserts (“-n”) until bad things happen.

 

So here is the “bad thing”:

 

[developer@lga-cassdev04 ~]$ /usr/bin/cassandra-stress -n  260000 -replication-factor 3 -d 10.201.3.80

Unable to create stress keyspace: Keyspace names must be case-insensitively unique ("Keyspace1" conflicts with "Keyspace1")

total,interval_op_rate,interval_key_rate,latency/95th/99th,elapsed_time

11308,1130,1130,42.2,45.6,85.9,10…

147663,906,906,43.1,45.2,85.2,121

147663,0,0,43.1,45.2,85.2,131

147663,0,0,43.1,45.2,85.2,141

147663,0,0,43.1,45.2,85.2,151

 

After Cassandra is overwhelmed and can’t write (ie when I see zeros in internal numbers) I stopped the test. Unless  I stop the Cassandra process (via a kill -9) on the that node (10.201.3.80 in this case) and restart it, I can’t write again to it nor do the pending or active numbers in MutationStage  change (for days – this example is only hours). However, after a restart (and a brief period of replaying) the node is up and happy.

 

My  query: Should a Cassandra node be able to recover from too many writes on its own? And if it can, what do I need to do to reach such a blissful state?

 

Thank you very much for your time, kindness and expertise!!

 

Regards,

  Eric Marshall

 

 

Details for the curious (including configs, errors and other snippets)

 

Symptoms:

 

developer@lga-casspoc01 ~ $ date

Mon Jul  1 22:58:54 EDT 2013

developer@lga-casspoc01 ~ $ nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0            200         0                 0

RequestResponseStage              0         0        2536230         0                 0

MutationStage                    32      1250        4927756         0                 0

ReadRepairStage                   0         0              0         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0          84071         0                 0

AntiEntropyStage                  0         0              0         0                 0

MigrationStage                    0         0              0         0                 0

MemtablePostFlusher               0         0             49         0                 0

FlushWriter                       0         0             10         0                 0

MiscStage                         0         0              0         0                 0

commitlog_archiver                0         0              0         0                 0

InternalResponseStage             0         0              0         0                 0

HintedHandoff                     0         0              2         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

developer@lga-casspoc01 ~ $ date

Mon Jul  1 22:59:34 EDT 2013

developer@lga-casspoc01 ~ $ nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0            200         0                 0

RequestResponseStage              0         0        2536230         0                 0

MutationStage                    32      1250        4927756         0                 0

ReadRepairStage                   0         0              0         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         234130         0                 0

AntiEntropyStage                  0         0              0         0                 0

MigrationStage                    0         0              0         0                 0

MemtablePostFlusher               0         0            112         0                 0

FlushWriter                       0         0             10         0                 0

MiscStage                         0         0              0         0                 0

commitlog_archiver                0         0              0         0                 0

InternalResponseStage             0         0              0         0                 0

HintedHandoff                     0         0              2         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

developer@lga-casspoc01 ~ $ date

Tue Jul  2 09:24:53 EDT 2013

developer@lga-casspoc01 ~ $ nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0            200         0                 0

RequestResponseStage              0         0        2536230         0                 0

MutationStage                    32      1250        4927756         0                 0

ReadRepairStage                   0         0              0         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         242157         0                 0

AntiEntropyStage                  0         0              0         0                 0

MigrationStage                    0         0              0         0                 0

MemtablePostFlusher               0         0            115         0                 0

FlushWriter                       0         0             10         0                 0

MiscStage                         0         0              0         0                 0

commitlog_archiver                0         0              0         0                 0

InternalResponseStage             0         0              0         0                 0

HintedHandoff                     0         0              2         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

developer@lga-casspoc01 ~ $ tail /var/log/cassandra/system.log

DEBUG [MemtablePostFlusher:1] 2013-07-02 09:19:49,012 ColumnFamilyStore.java (line 694) forceFlush requested but everything is clean in batchlog

DEBUG [OptionalTasks:1] 2013-07-02 09:19:49,012 BatchlogManager.java (line 192) Finished replayAllFailedBatches

DEBUG [ScheduledTasks:1] 2013-07-02 09:20:21,267 LoadBroadcaster.java (line 87) Disseminating load info ...

DEBUG [ScheduledTasks:1] 2013-07-02 09:21:21,267 LoadBroadcaster.java (line 87) Disseminating load info ...

DEBUG [ScheduledTasks:1] 2013-07-02 09:22:21,267 LoadBroadcaster.java (line 87) Disseminating load info ...

DEBUG [ScheduledTasks:1] 2013-07-02 09:23:21,267 LoadBroadcaster.java (line 87) Disseminating load info ...

DEBUG [ScheduledTasks:1] 2013-07-02 09:24:21,268 LoadBroadcaster.java (line 87) Disseminating load info ...

DEBUG [ScheduledTasks:1] 2013-07-02 09:25:21,268 LoadBroadcaster.java (line 87) Disseminating load info ...

DEBUG [ScheduledTasks:1] 2013-07-02 09:26:21,268 LoadBroadcaster.java (line 87) Disseminating load info ...

DEBUG [ScheduledTasks:1] 2013-07-02 09:27:21,269 LoadBroadcaster.java (line 87) Disseminating load info ...

developer@lga-casspoc01 ~ $ tail /var/log/cassandra/cassandra.log

DEBUG 17:09:19,145 clearing cached endpoints

DEBUG 17:09:19,145 clearing cached endpoints

DEBUG 17:09:19,146 clearing cached endpoints

DEBUG 17:09:19,147 No bootstrapping, leaving or moving nodes, and no relocating tokens -> empty pending ranges for Keyspace1

DEBUG 17:09:19,147 No bootstrapping, leaving or moving nodes, and no relocating tokens -> empty pending ranges for test1

DEBUG 17:09:19,147 No bootstrapping, leaving or moving nodes, and no relocating tokens -> empty pending ranges for system_auth

DEBUG 17:09:19,148 NORMAL

INFO 17:09:19,148 Startup completed! Now serving reads.

 

developer@lga-casspoc01 ~ $ nodetool status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address      Load       Tokens  Owns   Host ID                               Rack

UN  10.201.3.80  295.42 MB  256     41.8%  3bd60084-cff3-4ae3-9633-8fcc2fe07584  rack1

UN  10.201.3.82  262.65 MB  256     28.2%  86ff5bbf-310f-4b4c-aab6-4ea40b2b3b0c  rack1

UN  10.201.3.81  268.2 MB   256     30.1%  6ce48c31-4051-4f59-8bc4-d0d7890f99cd  rack1

 

 

Interesting bits from the logs:

DEBUG [ScheduledTasks:1] 2013-07-01 23:09:21,106 LoadBroadcaster.java (line 87) Disseminating load info ...

DEBUG [OptionalTasks:1] 2013-07-01 23:09:48,957 BatchlogManager.java (line 173) Started replayAllFailedBatches

DEBUG [MemtablePostFlusher:1] 2013-07-01 23:09:48,958 ColumnFamilyStore.java (line 694) forceFlush requested but everything is clean in batchlog

DEBUG [OptionalTasks:1] 2013-07-01 23:09:48,958 BatchlogManager.java (line 192) Finished replayAllFailedBatchesThe above repeats every ten minutes.

 

At the time of the failure: I saw these error repeatedly:

 

DEBUG [Thrift:1341] 2013-07-01 22:58:26,596 StorageProxy.java (line 217) Write timeout org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - received only 0 responses. for one (or more) of: [RowMutation(keyspace='Keyspace1', key='313436383233', modifications=[Standard1])]

 

DEBUG [Thrift:1312] 2013-07-01 22:58:26,590 StorageProxy.java (line 217) Write timeout org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - received only 0 responses. for one (or more) of: [RowMutation(keyspace='Keyspace1', key='313436383332', modifications=[Standard1])]

 

DEBUG [Thrift:1169] 2013-07-01 22:58:26,685 CustomTThreadPoolServer.java (line 209) Thrift transport error occurred during processing of message.

org.apache.thrift.transport.TTransportException

        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)

        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)

        at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)

        at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)

        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)

        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)

        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)

        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)

        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)

        at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)

        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)

        at java.lang.Thread.run(Thread.java:662)

 

configs:

rpm -qa | grep cassandra

cassandra12-1.2.5-1.noarch

 

/etc/Cassandra/conf/Cassandra.yaml

# saved caches

saved_caches_directory: /var/lib/cassandra/saved_caches

 

# commitlog_sync may be either "periodic" or "batch."

# When in batch mode, Cassandra won't ack writes until the commit log

# has been fsynced to disk.  It will wait up to

# commitlog_sync_batch_window_in_ms milliseconds for other writes, before

# performing the sync.

#

# commitlog_sync: batch

# commitlog_sync_batch_window_in_ms: 50

#

# the other option is "periodic" where writes may be acked immediately

# and the CommitLog is simply synced every commitlog_sync_period_in_ms

# milliseconds.

commitlog_sync: periodic

commitlog_sync_period_in_ms: 10000

 

# The size of the individual commitlog file segments.  A commitlog

# segment may be archived, deleted, or recycled once all the data

# in it (potentially from each columnfamily in the system) has been

# flushed to sstables. 

#

# The default size is 32, which is almost always fine, but if you are

# archiving commitlog segments (see commitlog_archiving.properties),

# then you probably want a finer granularity of archiving; 8 or 16 MB

# is reasonable.

commitlog_segment_size_in_mb: 32

 

# any class that implements the SeedProvider interface and has a

# constructor that takes a Map<String, String> of parameters will do.

seed_provider:

    # Addresses of hosts that are deemed contact points.

    # Cassandra nodes use this list of hosts to find each other and learn

    # the topology of the ring.  You must change this if you are running

    # multiple nodes!

    - class_name: org.apache.cassandra.locator.SimpleSeedProvider

      parameters:

          # seeds is actually a comma-delimited list of addresses.

          # Ex: "<ip1>,<ip2>,<ip3>"

          - seeds: "10.201.3.80"

 

# emergency pressure valve: each time heap usage after a full (CMS)

# garbage collection is above this fraction of the max, Cassandra will

# flush the largest memtables. 

#

# Set to 1.0 to disable.  Setting this lower than

# CMSInitiatingOccupancyFraction is not likely to be useful.

#

# RELYING ON THIS AS YOUR PRIMARY TUNING MECHANISM WILL WORK POORLY:

# it is most effective under light to moderate load, or read-heavy

# workloads; under truly massive write load, it will often be too

# little, too late.

flush_largest_memtables_at: 0.75

 

# emergency pressure valve #2: the first time heap usage after a full

# (CMS) garbage collection is above this fraction of the max,

# Cassandra will reduce cache maximum _capacity_ to the given fraction

# of the current _size_.  Should usually be set substantially above

# flush_largest_memtables_at, since that will have less long-term

# impact on the system. 

#

# Set to 1.0 to disable.  Setting this lower than

# CMSInitiatingOccupancyFraction is not likely to be useful.

reduce_cache_sizes_at: 0.85

reduce_cache_capacity_to: 0.6

 

# For workloads with more data than can fit in memory, Cassandra's

# bottleneck will be reads that need to fetch data from

# disk. "concurrent_reads" should be set to (16 * number_of_drives) in

# order to allow the operations to enqueue low enough in the stack

# that the OS and drives can reorder them.

#

# On the other hand, since writes are almost never IO bound, the ideal

# number of "concurrent_writes" is dependent on the number of cores in

# your system; (8 * number_of_cores) is a good rule of thumb.

concurrent_reads: 32

concurrent_writes: 32

 

# Total memory to use for memtables.  Cassandra will flush the largest

# memtable when this much memory is used.

# If omitted, Cassandra will set it to 1/3 of the heap.

# memtable_total_space_in_mb: 2048

 

# Total space to use for commitlogs.  Since commitlog segments are

# mmapped, and hence use up address space, the default size is 32

# on 32-bit JVMs, and 1024 on 64-bit JVMs.

#

# If space gets above this value (it will round up to the next nearest

# segment multiple), Cassandra will flush every dirty CF in the oldest

# segment and remove it.  So a small total commitlog space will tend

# to cause more flush activity on less-active columnfamilies.

# commitlog_total_space_in_mb: 4096

 

# This sets the amount of memtable flush writer threads.  These will

# be blocked by disk io, and each one will hold a memtable in memory

# while blocked. If you have a large heap and many data directories,

# you can increase this value for better flush performance.

# By default this will be set to the amount of data directories defined.

#memtable_flush_writers: 1

 

# the number of full memtables to allow pending flush, that is,

# waiting for a writer thread.  At a minimum, this should be set to

# the maximum number of secondary indexes created on a single CF.

memtable_flush_queue_size: 4

 

# Whether to, when doing sequential writing, fsync() at intervals in

# order to force the operating system to flush the dirty

# buffers. Enable this to avoid sudden dirty buffer flushing from

# impacting read latencies. Almost always a good idea on SSDs; not

# necessarily on platters.

trickle_fsync: false

trickle_fsync_interval_in_kb: 10240

 

# TCP port, for commands and data

storage_port: 7000

 

# SSL port, for encrypted communication.  Unused unless enabled in

# encryption_options

ssl_storage_port: 7001

 

# Address to bind to and tell other Cassandra nodes to connect to. You

# _must_ change this if you want multiple nodes to be able to

# communicate!

#

# Leaving it blank leaves it up to InetAddress.getLocalHost(). This

# will always do the Right Thing _if_ the node is properly configured

# (hostname, name resolution, etc), and the Right Thing is to use the

# address associated with the hostname (it might not be).

#

# Setting this to 0.0.0.0 is always wrong.

listen_address: 10.201.3.80

 

# Address to broadcast to other Cassandra nodes

# Leaving this blank will set it to the same value as listen_address

# broadcast_address: 1.2.3.4

 

# Internode authentication backend, implementing IInternodeAuthenticator;

# used to allow/disallow connections from peer nodes.

# internode_authenticator: org.apache.cassandra.auth.AllowAllInternodeAuthenticator

 

# Whether to start the native transport server.

# Please note that the address on which the native transport is bound is the

# same as the rpc_address. The port however is different and specified below.

start_native_transport: true

# port for the CQL native transport to listen for clients on

native_transport_port: 9042

# The minimum and maximum threads for handling requests when the native

# transport is used. They are similar to rpc_min_threads and rpc_max_threads,

# though the defaults differ slightly.

# native_transport_min_threads: 16

# native_transport_max_threads: 128

 

# Whether to start the thrift rpc server.

start_rpc: true

 

# The address to bind the Thrift RPC service to -- clients connect

# here. Unlike ListenAddress above, you _can_ specify 0.0.0.0 here if

# you want Thrift to listen on all interfaces.

#

# Leaving this blank has the same effect it does for ListenAddress,

# (i.e. it will be based on the configured hostname of the node).

rpc_address: 10.201.3.80

# port for Thrift to listen for clients on

rpc_port: 9160

 

# enable or disable keepalive on rpc connections

rpc_keepalive: true

 

# Cassandra provides three out-of-the-box options for the RPC Server:

#

# sync  -> One thread per thrift connection. For a very large number of clients, memory

#          will be your limiting factor. On a 64 bit JVM, 180KB is the minimum stack size

#          per thread, and that will correspond to your use of virtual memory (but physical memory

#          may be limited depending on use of stack space).

#

# hsha  -> Stands for "half synchronous, half asynchronous." All thrift clients are handled

#          asynchronously using a small number of threads that does not vary with the amount

#          of thrift clients (and thus scales well to many clients). The rpc requests are still

#          synchronous (one thread per active request).

#

# The default is sync because on Windows hsha is about 30% slower.  On Linux,

# sync/hsha performance is about the same, with hsha of course using less memory.

#

# Alternatively,  can provide your own RPC server by providing the fully-qualified class name

# of an o.a.c.t.TServerFactory that can create an instance of it.

rpc_server_type: sync

 

# Uncomment rpc_min|max_thread to set request pool size limits.

#

# Regardless of your choice of RPC server (see above), the number of maximum requests in the

# RPC thread pool dictates how many concurrent requests are possible (but if you are using the sync

# RPC server, it also dictates the number of clients that can be connected at all).

#

# The default is unlimited and thus provides no protection against clients overwhelming the server. You are

# encouraged to set a maximum that makes sense for you in production, but do keep in mind that

# rpc_max_threads represents the maximum number of client requests this server may execute concurrently.

#

# rpc_min_threads: 16

# rpc_max_threads: 2048

 

# uncomment to set socket buffer sizes on rpc connections

# rpc_send_buff_size_in_bytes:

# rpc_recv_buff_size_in_bytes:

 

# Uncomment to set socket buffer size for internode communication

# Note that when setting this, the buffer size is limited by net.core.wmem_max

# and when not setting it it is defined by net.ipv4.tcp_wmem

# See:

# /proc/sys/net/core/wmem_max

# /proc/sys/net/core/rmem_max

# /proc/sys/net/ipv4/tcp_wmem

# /proc/sys/net/ipv4/tcp_wmem

# and: man tcp

# internode_send_buff_size_in_bytes:

# internode_recv_buff_size_in_bytes:

 

# Frame size for thrift (maximum field length).

thrift_framed_transport_size_in_mb: 15

 

# The max length of a thrift message, including all fields and

# internal thrift overhead.

thrift_max_message_length_in_mb: 16

 

# Set to true to have Cassandra create a hard link to each sstable

# flushed or streamed locally in a backups/ subdirectory of the

# keyspace data.  Removing these links is the operator's

# responsibility.

incremental_backups: false

 

# Whether or not to take a snapshot before each compaction.  Be

# careful using this option, since Cassandra won't clean up the

# snapshots for you.  Mostly useful if you're paranoid when there

# is a data format change.

snapshot_before_compaction: false

 

# Whether or not a snapshot is taken of the data before keyspace truncation

# or dropping of column families. The STRONGLY advised default of true

# should be used to provide data safety. If you set this flag to false, you will

# lose data on truncation or drop.

auto_snapshot: true

 

# Add column indexes to a row after its contents reach this size.

# Increase if your column values are large, or if you have a very large

# number of columns.  The competing causes are, Cassandra has to

# deserialize this much of the row to read a single column, so you want

# it to be small - at least if you do many partial-row reads - but all

# the index data is read for each access, so you don't want to generate

# that wastefully either.

column_index_size_in_kb: 64

 

# Size limit for rows being compacted in memory.  Larger rows will spill

# over to disk and use a slower two-pass compaction process.  A message

# will be logged specifying the row key.

in_memory_compaction_limit_in_mb: 64

 

# Number of simultaneous compactions to allow, NOT including

# validation "compactions" for anti-entropy repair.  Simultaneous

# compactions can help preserve read performance in a mixed read/write

# workload, by mitigating the tendency of small sstables to accumulate

# during a single long running compactions. The default is usually

# fine and if you experience problems with compaction running too

# slowly or too fast, you should look at

# compaction_throughput_mb_per_sec first.

#

# concurrent_compactors defaults to the number of cores.

# Uncomment to make compaction mono-threaded, the pre-0.8 default.

#concurrent_compactors: 1

 

# Multi-threaded compaction. When enabled, each compaction will use

# up to one thread per core, plus one thread per sstable being merged.

# This is usually only useful for SSD-based hardware: otherwise,

# your concern is usually to get compaction to do LESS i/o (see:

# compaction_throughput_mb_per_sec), not more.

multithreaded_compaction: false

 

# Throttles compaction to the given total throughput across the entire

# system. The faster you insert data, the faster you need to compact in

# order to keep the sstable count down, but in general, setting this to

# 16 to 32 times the rate you are inserting data is more than sufficient.

# Setting this to 0 disables throttling. Note that this account for all types

# of compaction, including validation compaction.

compaction_throughput_mb_per_sec: 16

 

# Track cached row keys during compaction, and re-cache their new

# positions in the compacted sstable.  Disable if you use really large

# key caches.

compaction_preheat_key_cache: true

 

# Throttles all outbound streaming file transfers on this node to the

# given total throughput in Mbps. This is necessary because Cassandra does

# mostly sequential IO when streaming data during bootstrap or repair, which

# can lead to saturating the network connection and degrading rpc performance.

# When unset, the default is 200 Mbps or 25 MB/s.

# stream_throughput_outbound_megabits_per_sec: 200

 

# How long the coordinator should wait for read operations to complete

read_request_timeout_in_ms: 10000

# How long the coordinator should wait for seq or index scans to complete

range_request_timeout_in_ms: 10000

# How long the coordinator should wait for writes to complete

write_request_timeout_in_ms: 10000

# How long the coordinator should wait for truncates to complete

# (This can be much longer, because unless auto_snapshot is disabled

# we need to flush first so we can snapshot before removing the data.)

truncate_request_timeout_in_ms: 60000

# The default timeout for other, miscellaneous operations

request_timeout_in_ms: 10000

 

# Enable operation timeout information exchange between nodes to accurately

# measure request timeouts, If disabled cassandra will assuming the request

# was forwarded to the replica instantly by the coordinator

#

# Warning: before enabling this property make sure to ntp is installed

# and the times are synchronized between the nodes.

cross_node_timeout: false

 

# Enable socket timeout for streaming operation.

# When a timeout occurs during streaming, streaming is retried from the start

# of the current file. This _can_ involve re-streaming an important amount of

# data, so you should avoid setting the value too low.

# Default value is 0, which never timeout streams.

# streaming_socket_timeout_in_ms: 0

 

# phi value that must be reached for a host to be marked down.

# most users should never need to adjust this.

# phi_convict_threshold: 8

 

# endpoint_snitch -- Set this to a class that implements

# IEndpointSnitch.  The snitch has two functions:

# - it teaches Cassandra enough about your network topology to route

#   requests efficiently

# - it allows Cassandra to spread replicas around your cluster to avoid

#   correlated failures. It does this by grouping machines into

#   "datacenters" and "racks."  Cassandra will do its best not to have

#   more than one replica on the same "rack" (which may not actually

#   be a physical location)

#

# IF YOU CHANGE THE SNITCH AFTER DATA IS INSERTED INTO THE CLUSTER,

# YOU MUST RUN A FULL REPAIR, SINCE THE SNITCH AFFECTS WHERE REPLICAS

# ARE PLACED.

#

# Out of the box, Cassandra provides

#  - SimpleSnitch:

#    Treats Strategy order as proximity. This improves cache locality

#    when disabling read repair, which can further improve throughput.

#    Only appropriate for single-datacenter deployments.

#  - PropertyFileSnitch:

#    Proximity is determined by rack and data center, which are

#    explicitly configured in cassandra-topology.properties.

#  - GossipingPropertyFileSnitch

#    The rack and datacenter for the local node are defined in

#    cassandra-rackdc.properties and propagated to other nodes via gossip.  If

#    cassandra-topology.properties exists, it is used as a fallback, allowing

#    migration from the PropertyFileSnitch.

#  - RackInferringSnitch:

#    Proximity is determined by rack and data center, which are

#    assumed to correspond to the 3rd and 2nd octet of each node's

#    IP address, respectively.  Unless this happens to match your

#    deployment conventions (as it did Facebook's), this is best used

#    as an example of writing a custom Snitch class.

#  - Ec2Snitch:

#    Appropriate for EC2 deployments in a single Region. Loads Region

#    and Availability Zone information from the EC2 API. The Region is

#    treated as the datacenter, and the Availability Zone as the rack.

#    Only private IPs are used, so this will not work across multiple

#    Regions.

#  - Ec2MultiRegionSnitch:

#    Uses public IPs as broadcast_address to allow cross-region

#    connectivity.  (Thus, you should set seed addresses to the public

#    IP as well.) You will need to open the storage_port or

#    ssl_storage_port on the public IP firewall.  (For intra-Region

#    traffic, Cassandra will switch to the private IP after

#    establishing a connection.)

#

# You can use a custom Snitch by setting this to the full class name

# of the snitch, which will be assumed to be on your classpath.

endpoint_snitch: SimpleSnitch

 

# controls how often to perform the more expensive part of host score

# calculation

dynamic_snitch_update_interval_in_ms: 100

# controls how often to reset all host scores, allowing a bad host to

# possibly recover

dynamic_snitch_reset_interval_in_ms: 600000

# if set greater than zero and read_repair_chance is < 1.0, this will allow

# 'pinning' of replicas to hosts in order to increase cache capacity.

# The badness threshold will control how much worse the pinned host has to be

# before the dynamic snitch will prefer other replicas over it.  This is

# expressed as a double which represents a percentage.  Thus, a value of

# 0.2 means Cassandra would continue to prefer the static snitch values

# until the pinned host was 20% worse than the fastest.

dynamic_snitch_badness_threshold: 0.1

 

# request_scheduler -- Set this to a class that implements

# RequestScheduler, which will schedule incoming client requests

# according to the specific policy. This is useful for multi-tenancy

# with a single Cassandra cluster.

# NOTE: This is specifically for requests from the client and does

# not affect inter node communication.

# org.apache.cassandra.scheduler.NoScheduler - No scheduling takes place

# org.apache.cassandra.scheduler.RoundRobinScheduler - Round robin of

# client requests to a node with a separate queue for each

# request_scheduler_id. The scheduler is further customized by

# request_scheduler_options as described below.

request_scheduler: org.apache.cassandra.scheduler.NoScheduler

 

# Scheduler Options vary based on the type of scheduler

# NoScheduler - Has no options

# RoundRobin

#  - throttle_limit -- The throttle_limit is the number of in-flight

#                      requests per client.  Requests beyond

#                      that limit are queued up until

#                      running requests can complete.

#                      The value of 80 here is twice the number of

#                      concurrent_reads + concurrent_writes.

#  - default_weight -- default_weight is optional and allows for

#                      overriding the default which is 1.

#  - weights -- Weights are optional and will default to 1 or the

#               overridden default_weight. The weight translates into how

#               many requests are handled during each turn of the

#               RoundRobin, based on the scheduler id.

#

# request_scheduler_options:

#    throttle_limit: 80

#    default_weight: 5

#    weights:

#      Keyspace1: 1

#      Keyspace2: 5

 

# request_scheduler_id -- An identifier based on which to perform

# the request scheduling. Currently the only valid option is keyspace.

# request_scheduler_id: keyspace

 

# index_interval controls the sampling of entries from the primrary

# row index in terms of space versus time.  The larger the interval,

# the smaller and less effective the sampling will be.  In technicial

# terms, the interval coresponds to the number of index entries that

# are skipped between taking each sample.  All the sampled entries

# must fit in memory.  Generally, a value between 128 and 512 here

# coupled with a large key cache size on CFs results in the best trade

# offs.  This value is not often changed, however if you have many

# very small rows (many to an OS page), then increasing this will

# often lower memory usage without a impact on performance.

index_interval: 128

 

# Enable or disable inter-node encryption

# Default settings are TLS v1, RSA 1024-bit keys (it is imperative that

# users generate their own keys) TLS_RSA_WITH_AES_128_CBC_SHA as the cipher

# suite for authentication, key exchange and encryption of the actual data transfers.

# NOTE: No custom encryption options are enabled at the moment

# The available internode options are : all, none, dc, rack

#

# If set to dc cassandra will encrypt the traffic between the DCs

# If set to rack cassandra will encrypt the traffic between the racks

#

# The passwords used in these options must match the passwords used when generating

# the keystore and truststore.  For instructions on generating these files, see:

# http://download.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#CreateKeystore

#

server_encryption_options:

    internode_encryption: none

    keystore: conf/.keystore

    keystore_password: cassandra

    truststore: conf/.truststore

    truststore_password: cassandra

    # More advanced defaults below:

    # protocol: TLS

    # algorithm: SunX509

    # store_type: JKS

    # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA]

    # require_client_auth: false

 

# enable or disable client/server encryption.

client_encryption_options:

   enabled: false

    keystore: conf/.keystore

    keystore_password: cassandra

    # require_client_auth: false

    # Set trustore and truststore_password if require_client_auth is true

    # truststore: conf/.truststore

    # truststore_password: cassandra

    # More advanced defaults below:

    # protocol: TLS

    # algorithm: SunX509

    # store_type: JKS

    # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA]

 

# internode_compression controls whether traffic between nodes is

# compressed.

# can be:  all  - all traffic is compressed

#          dc   - traffic between different datacenters is compressed

#          none - nothing is compressed.

internode_compression: all

 

# Enable or disable tcp_nodelay for inter-dc communication.

# Disabling it will result in larger (but fewer) network packets being sent,

# reducing overhead from the TCP protocol itself, at the cost of increasing

# latency if you block for cross-datacenter responses.

inter_dc_tcp_nodelay: true

# this defines the maximum amount of time a dead host will have hints

# generated.  After it has been dead this long, new hints for it will not be

# created until it has been seen alive and gone down again.

max_hint_window_in_ms: 10800000 # 3 hours

# throttle in KBs per second, per delivery thread

hinted_handoff_throttle_in_kb: 1024

# Number of threads with which to deliver hints;

# Consider increasing this number when you have multi-dc deployments, since

# cross-dc handoff tends to be slower

max_hints_delivery_threads: 2

 

# The following setting populates the page cache on memtable flush and compaction

# WARNING: Enable this setting only when the whole node's data fits in memory.

# Defaults to: false

# populate_io_cache_on_flush: false

 

# Authentication backend, implementing IAuthenticator; used to identify users

# Out of the box, Cassandra provides org.apache.cassandra.auth.{AllowAllAuthenticator,

# PasswordAuthenticator}.

#

# - AllowAllAuthenticator performs no checks - set it to disable authentication.

# - PasswordAuthenticator relies on username/password pairs to authenticate

#   users. It keeps usernames and hashed passwords in system_auth.credentials table.

#   Please increase system_auth keyspace replication factor if you use this authenticator.

authenticator: org.apache.cassandra.auth.AllowAllAuthenticator

 

# Authorization backend, implementing IAuthorizer; used to limit access/provide permissions

# Out of the box, Cassandra provides org.apache.cassandra.auth.{AllowAllAuthorizer,

# CassandraAuthorizer}.

#

# - AllowAllAuthorizer allows any action to any user - set it to disable authorization.

# - CassandraAuthorizer stores permissions in system_auth.permissions table. Please

#   increase system_auth keyspace replication factor if you use this authorizer.

authorizer: org.apache.cassandra.auth.AllowAllAuthorizer

 

# Validity period for permissions cache (fetching permissions can be an

# expensive operation depending on the authorizer, CassandraAuthorizer is

# one example). Defaults to 2000, set to 0 to disable.

# Will be disabled automatically for AllowAllAuthorizer.

permissions_validity_in_ms: 2000

 

# The partitioner is responsible for distributing rows (by key) across

# nodes in the cluster.  Any IPartitioner may be used, including your

# own as long as it is on the classpath.  Out of the box, Cassandra

# provides org.apache.cassandra.dht.{Murmur3Partitioner, RandomPartitioner

# ByteOrderedPartitioner, OrderPreservingPartitioner (deprecated)}.

#

# - RandomPartitioner distributes rows across the cluster evenly by md5.

#   This is the default prior to 1.2 and is retained for compatibility.

# - Murmur3Partitioner is similar to RandomPartioner but uses Murmur3_128

#   Hash Function instead of md5.  When in doubt, this is the best option.

# - ByteOrderedPartitioner orders rows lexically by key bytes.  BOP allows

#   scanning rows in key order, but the ordering can generate hot spots

#   for sequential insertion workloads.

# - OrderPreservingPartitioner is an obsolete form of BOP, that stores

# - keys in a less-efficient format and only works with keys that are

#   UTF8-encoded Strings.

# - CollatingOPP collates according to EN,US rules rather than lexical byte

#   ordering.  Use this as an example if you need custom collation.

#

# See http://wiki.apache.org/cassandra/Operations for more on

# partitioners and token selection.

partitioner: org.apache.cassandra.dht.Murmur3Partitioner

 

# Directories where Cassandra should store data on disk.  Cassandra

# will spread data evenly across them, subject to the granularity of

# the configured compaction strategy.

data_file_directories:

#    - /var/lib/cassandra/data

    - /cassandra/data

 

# commit log

commitlog_directory: /var/lib/cassandra/commitlog

 

# policy for data disk failures:

# stop: shut down gossip and Thrift, leaving the node effectively dead, but

#       can still be inspected via JMX.

# best_effort: stop using the failed disk and respond to requests based on

#              remaining available sstables.  This means you WILL see obsolete

#              data at CL.ONE!

# ignore: ignore fatal errors and let requests fail, as in pre-1.2 Cassandra

disk_failure_policy: stop

# Maximum size of the key cache in memory.

#

# Each key cache hit saves 1 seek and each row cache hit saves 2 seeks at the

# minimum, sometimes more. The key cache is fairly tiny for the amount of

# time it saves, so it's worthwhile to use it at large numbers.

# The row cache saves even more time, but must contain the entire row,

# so it is extremely space-intensive. It's best to only use the

# row cache if you have hot rows or static rows.

#

# NOTE: if you reduce the size, you may not get you hottest keys loaded on startup.

#

# Default value is empty to make it "auto" (min(5% of Heap (in MB), 100MB)). Set to 0 to disable key cache.

key_cache_size_in_mb:

 

# Duration in seconds after which Cassandra should

# save the key cache. Caches are saved to saved_caches_directory as

# specified in this configuration file.

#

# Saved caches greatly improve cold-start speeds, and is relatively cheap in

# terms of I/O for the key cache. Row cache saving is much more expensive and

# has limited use.

#

# Default is 14400 or 4 hours.

key_cache_save_period: 14400

 

# Number of keys from the key cache to save

# Disabled by default, meaning all keys are going to be saved

# key_cache_keys_to_save: 100

 

# Maximum size of the row cache in memory.

# NOTE: if you reduce the size, you may not get you hottest keys loaded on startup.

#

# Default value is 0, to disable row caching.

row_cache_size_in_mb: 0

 

# Duration in seconds after which Cassandra should

# safe the row cache. Caches are saved to saved_caches_directory as specified

# in this configuration file.

#

# Saved caches greatly improve cold-start speeds, and is relatively cheap in

# terms of I/O for the key cache. Row cache saving is much more expensive and

# has limited use.

#

# Default is 0 to disable saving the row cache.

row_cache_save_period: 0

 

# Number of keys from the row cache to save

# Disabled by default, meaning all keys are going to be saved

# row_cache_keys_to_save: 100

 

# The provider for the row cache to use.

#

# Supported values are: ConcurrentLinkedHashCacheProvider, SerializingCacheProvider

#

# SerializingCacheProvider serialises the contents of the row and stores

# it in native memory, i.e., off the JVM Heap. Serialized rows take

# significantly less memory than "live" rows in the JVM, so you can cache

# more rows in a given memory footprint.  And storing the cache off-heap

# means you can use smaller heap sizes, reducing the impact of GC pauses.

# Note however that when a row is requested from the row cache, it must be

# deserialized into the heap for use.

#

# It is also valid to specify the fully-qualified class name to a class

# that implements org.apache.cassandra.cache.IRowCacheProvider.

#

# Defaults to SerializingCacheProvider

row_cache_provider: SerializingCacheProvider

 

# saved caches

saved_caches_directory: /var/lib/cassandra/saved_caches

 

# commitlog_sync may be either "periodic" or "batch."

# When in batch mode, Cassandra won't ack writes until the commit log

# has been fsynced to disk.  It will wait up to

# commitlog_sync_batch_window_in_ms milliseconds for other writes, before

# performing the sync.

#

# commitlog_sync: batch

# commitlog_sync_batch_window_in_ms: 50

#

# the other option is "periodic" where writes may be acked immediately

# and the CommitLog is simply synced every commitlog_sync_period_in_ms

# milliseconds.

commitlog_sync: periodic

commitlog_sync_period_in_ms: 10000

 

# The size of the individual commitlog file segments.  A commitlog

# segment may be archived, deleted, or recycled once all the data

# in it (potentially from each columnfamily in the system) has been

# flushed to sstables. 

#

# The default size is 32, which is almost always fine, but if you are

# archiving commitlog segments (see commitlog_archiving.properties),

# then you probably want a finer granularity of archiving; 8 or 16 MB

# is reasonable.

commitlog_segment_size_in_mb: 32

 

# any class that implements the SeedProvider interface and has a

# constructor that takes a Map<String, String> of parameters will do.

seed_provider:

    # Addresses of hosts that are deemed contact points.

    # Cassandra nodes use this list of hosts to find each other and learn

    # the topology of the ring.  You must change this if you are running

    # multiple nodes!

    - class_name: org.apache.cassandra.locator.SimpleSeedProvider

      parameters:

          # seeds is actually a comma-delimited list of addresses.

          # Ex: "<ip1>,<ip2>,<ip3>"

          - seeds: "10.201.3.80"

 

# emergency pressure valve: each time heap usage after a full (CMS)

# garbage collection is above this fraction of the max, Cassandra will

# flush the largest memtables. 

#

# Set to 1.0 to disable.  Setting this lower than

# CMSInitiatingOccupancyFraction is not likely to be useful.

#

# RELYING ON THIS AS YOUR PRIMARY TUNING MECHANISM WILL WORK POORLY:

# it is most effective under light to moderate load, or read-heavy

# workloads; under truly massive write load, it will often be too

# little, too late.

flush_largest_memtables_at: 0.75

 

# emergency pressure valve #2: the first time heap usage after a full

# (CMS) garbage collection is above this fraction of the max,

# Cassandra will reduce cache maximum _capacity_ to the given fraction

# of the current _size_.  Should usually be set substantially above

# flush_largest_memtables_at, since that will have less long-term

# impact on the system. 

#

# Set to 1.0 to disable.  Setting this lower than

# CMSInitiatingOccupancyFraction is not likely to be useful.

reduce_cache_sizes_at: 0.85

reduce_cache_capacity_to: 0.6

 

# For workloads with more data than can fit in memory, Cassandra's

# bottleneck will be reads that need to fetch data from

# disk. "concurrent_reads" should be set to (16 * number_of_drives) in

# order to allow the operations to enqueue low enough in the stack

# that the OS and drives can reorder them.

#

# On the other hand, since writes are almost never IO bound, the ideal

# number of "concurrent_writes" is dependent on the number of cores in

# your system; (8 * number_of_cores) is a good rule of thumb.

concurrent_reads: 32

concurrent_writes: 32

 

# Total memory to use for memtables.  Cassandra will flush the largest

# memtable when this much memory is used.

# If omitted, Cassandra will set it to 1/3 of the heap.

# memtable_total_space_in_mb: 2048

 

# Total space to use for commitlogs.  Since commitlog segments are

# mmapped, and hence use up address space, the default size is 32

# on 32-bit JVMs, and 1024 on 64-bit JVMs.

#

# If space gets above this value (it will round up to the next nearest

# segment multiple), Cassandra will flush every dirty CF in the oldest

# segment and remove it.  So a small total commitlog space will tend

# to cause more flush activity on less-active columnfamilies.

# commitlog_total_space_in_mb: 4096

 

# This sets the amount of memtable flush writer threads.  These will

# be blocked by disk io, and each one will hold a memtable in memory

# while blocked. If you have a large heap and many data directories,

# you can increase this value for better flush performance.

# By default this will be set to the amount of data directories defined.

#memtable_flush_writers: 1

 

# the number of full memtables to allow pending flush, that is,

# waiting for a writer thread.  At a minimum, this should be set to

# the maximum number of secondary indexes created on a single CF.

memtable_flush_queue_size: 4

 

# Whether to, when doing sequential writing, fsync() at intervals in

# order to force the operating system to flush the dirty

# buffers. Enable this to avoid sudden dirty buffer flushing from

# impacting read latencies. Almost always a good idea on SSDs; not

# necessarily on platters.

trickle_fsync: false

trickle_fsync_interval_in_kb: 10240

 

# TCP port, for commands and data

storage_port: 7000

 

# SSL port, for encrypted communication.  Unused unless enabled in

# encryption_options

ssl_storage_port: 7001

 

# Address to bind to and tell other Cassandra nodes to connect to. You

# _must_ change this if you want multiple nodes to be able to

# communicate!

#

# Leaving it blank leaves it up to InetAddress.getLocalHost(). This

# will always do the Right Thing _if_ the node is properly configured

# (hostname, name resolution, etc), and the Right Thing is to use the

# address associated with the hostname (it might not be).

#

# Setting this to 0.0.0.0 is always wrong.

listen_address: 10.201.3.80

 

# Address to broadcast to other Cassandra nodes

# Leaving this blank will set it to the same value as listen_address

# broadcast_address: 1.2.3.4

 

# Internode authentication backend, implementing IInternodeAuthenticator;

# used to allow/disallow connections from peer nodes.

# internode_authenticator: org.apache.cassandra.auth.AllowAllInternodeAuthenticator

 

# Whether to start the native transport server.

# Please note that the address on which the native transport is bound is the

# same as the rpc_address. The port however is different and specified below.

start_native_transport: true

# port for the CQL native transport to listen for clients on

native_transport_port: 9042

# The minimum and maximum threads for handling requests when the native

# transport is used. They are similar to rpc_min_threads and rpc_max_threads,

# though the defaults differ slightly.

# native_transport_min_threads: 16

# native_transport_max_threads: 128

 

# Whether to start the thrift rpc server.

start_rpc: true

 

# The address to bind the Thrift RPC service to -- clients connect

# here. Unlike ListenAddress above, you _can_ specify 0.0.0.0 here if

# you want Thrift to listen on all interfaces.

#

# Leaving this blank has the same effect it does for ListenAddress,

# (i.e. it will be based on the configured hostname of the node).

rpc_address: 10.201.3.80

# port for Thrift to listen for clients on

rpc_port: 9160

 

# enable or disable keepalive on rpc connections

rpc_keepalive: true

 

# Cassandra provides three out-of-the-box options for the RPC Server:

#

# sync  -> One thread per thrift connection. For a very large number of clients, memory

#          will be your limiting factor. On a 64 bit JVM, 180KB is the minimum stack size

#          per thread, and that will correspond to your use of virtual memory (but physical memory

#          may be limited depending on use of stack space).

#

# hsha  -> Stands for "half synchronous, half asynchronous." All thrift clients are handled

#          asynchronously using a small number of threads that does not vary with the amount

#          of thrift clients (and thus scales well to many clients). The rpc requests are still

#          synchronous (one thread per active request).

#

# The default is sync because on Windows hsha is about 30% slower.  On Linux,

# sync/hsha performance is about the same, with hsha of course using less memory.

#

# Alternatively,  can provide your own RPC server by providing the fully-qualified class name

# of an o.a.c.t.TServerFactory that can create an instance of it.

rpc_server_type: sync

 

# Uncomment rpc_min|max_thread to set request pool size limits.

#

# Regardless of your choice of RPC server (see above), the number of maximum requests in the

# RPC thread pool dictates how many concurrent requests are possible (but if you are using the sync

# RPC server, it also dictates the number of clients that can be connected at all).

#

# The default is unlimited and thus provides no protection against clients overwhelming the server. You are

# encouraged to set a maximum that makes sense for you in production, but do keep in mind that

# rpc_max_threads represents the maximum number of client requests this server may execute concurrently.

#

# rpc_min_threads: 16

# rpc_max_threads: 2048

 

# uncomment to set socket buffer sizes on rpc connections

# rpc_send_buff_size_in_bytes:

# rpc_recv_buff_size_in_bytes:

 

# Uncomment to set socket buffer size for internode communication

# Note that when setting this, the buffer size is limited by net.core.wmem_max

# and when not setting it it is defined by net.ipv4.tcp_wmem

# See:

# /proc/sys/net/core/wmem_max

# /proc/sys/net/core/rmem_max

# /proc/sys/net/ipv4/tcp_wmem

# /proc/sys/net/ipv4/tcp_wmem

# and: man tcp

# internode_send_buff_size_in_bytes:

# internode_recv_buff_size_in_bytes:

 

# Frame size for thrift (maximum field length).

thrift_framed_transport_size_in_mb: 15

 

# The max length of a thrift message, including all fields and

# internal thrift overhead.

thrift_max_message_length_in_mb: 16

 

# Set to true to have Cassandra create a hard link to each sstable

# flushed or streamed locally in a backups/ subdirectory of the

# keyspace data.  Removing these links is the operator's

# responsibility.

incremental_backups: false

 

# Whether or not to take a snapshot before each compaction.  Be

# careful using this option, since Cassandra won't clean up the

# snapshots for you.  Mostly useful if you're paranoid when there

# is a data format change.

snapshot_before_compaction: false

 

# Whether or not a snapshot is taken of the data before keyspace truncation

# or dropping of column families. The STRONGLY advised default of true

# should be used to provide data safety. If you set this flag to false, you will

# lose data on truncation or drop.

auto_snapshot: true

 

# Add column indexes to a row after its contents reach this size.

# Increase if your column values are large, or if you have a very large

# number of columns.  The competing causes are, Cassandra has to

# deserialize this much of the row to read a single column, so you want

# it to be small - at least if you do many partial-row reads - but all

# the index data is read for each access, so you don't want to generate

# that wastefully either.

column_index_size_in_kb: 64

 

# Size limit for rows being compacted in memory.  Larger rows will spill

# over to disk and use a slower two-pass compaction process.  A message

# will be logged specifying the row key.

in_memory_compaction_limit_in_mb: 64

 

# Number of simultaneous compactions to allow, NOT including

# validation "compactions" for anti-entropy repair.  Simultaneous

# compactions can help preserve read performance in a mixed read/write

# workload, by mitigating the tendency of small sstables to accumulate

# during a single long running compactions. The default is usually

# fine and if you experience problems with compaction running too

# slowly or too fast, you should look at

# compaction_throughput_mb_per_sec first.

#

# concurrent_compactors defaults to the number of cores.

# Uncomment to make compaction mono-threaded, the pre-0.8 default.

#concurrent_compactors: 1

 

# Multi-threaded compaction. When enabled, each compaction will use

# up to one thread per core, plus one thread per sstable being merged.

# This is usually only useful for SSD-based hardware: otherwise,

# your concern is usually to get compaction to do LESS i/o (see:

# compaction_throughput_mb_per_sec), not more.

multithreaded_compaction: false

 

# Throttles compaction to the given total throughput across the entire

# system. The faster you insert data, the faster you need to compact in

# order to keep the sstable count down, but in general, setting this to

# 16 to 32 times the rate you are inserting data is more than sufficient.

# Setting this to 0 disables throttling. Note that this account for all types

# of compaction, including validation compaction.

compaction_throughput_mb_per_sec: 16

 

# Track cached row keys during compaction, and re-cache their new

# positions in the compacted sstable.  Disable if you use really large

# key caches.

compaction_preheat_key_cache: true

 

# Throttles all outbound streaming file transfers on this node to the

# given total throughput in Mbps. This is necessary because Cassandra does

# mostly sequential IO when streaming data during bootstrap or repair, which

# can lead to saturating the network connection and degrading rpc performance.

# When unset, the default is 200 Mbps or 25 MB/s.

# stream_throughput_outbound_megabits_per_sec: 200

 

# How long the coordinator should wait for read operations to complete

read_request_timeout_in_ms: 10000

# How long the coordinator should wait for seq or index scans to complete

range_request_timeout_in_ms: 10000

# How long the coordinator should wait for writes to complete

write_request_timeout_in_ms: 10000

# How long the coordinator should wait for truncates to complete

# (This can be much longer, because unless auto_snapshot is disabled

# we need to flush first so we can snapshot before removing the data.)

truncate_request_timeout_in_ms: 60000

# The default timeout for other, miscellaneous operations

request_timeout_in_ms: 10000

 

# Enable operation timeout information exchange between nodes to accurately

# measure request timeouts, If disabled cassandra will assuming the request

# was forwarded to the replica instantly by the coordinator

#

# Warning: before enabling this property make sure to ntp is installed

# and the times are synchronized between the nodes.

cross_node_timeout: false

 

# Enable socket timeout for streaming operation.

# When a timeout occurs during streaming, streaming is retried from the start

# of the current file. This _can_ involve re-streaming an important amount of

# data, so you should avoid setting the value too low.

# Default value is 0, which never timeout streams.

# streaming_socket_timeout_in_ms: 0

 

# phi value that must be reached for a host to be marked down.

# most users should never need to adjust this.

# phi_convict_threshold: 8

 

# endpoint_snitch -- Set this to a class that implements

# IEndpointSnitch.  The snitch has two functions:

# - it teaches Cassandra enough about your network topology to route

#   requests efficiently

# - it allows Cassandra to spread replicas around your cluster to avoid

#   correlated failures. It does this by grouping machines into

#   "datacenters" and "racks."  Cassandra will do its best not to have

#   more than one replica on the same "rack" (which may not actually

#   be a physical location)

#

# IF YOU CHANGE THE SNITCH AFTER DATA IS INSERTED INTO THE CLUSTER,

# YOU MUST RUN A FULL REPAIR, SINCE THE SNITCH AFFECTS WHERE REPLICAS

# ARE PLACED.

#

# Out of the box, Cassandra provides

#  - SimpleSnitch:

#    Treats Strategy order as proximity. This improves cache locality

#    when disabling read repair, which can further improve throughput.

#    Only appropriate for single-datacenter deployments.

#  - PropertyFileSnitch:

#    Proximity is determined by rack and data center, which are

#    explicitly configured in cassandra-topology.properties.

#  - GossipingPropertyFileSnitch

#    The rack and datacenter for the local node are defined in

#    cassandra-rackdc.properties and propagated to other nodes via gossip.  If

#    cassandra-topology.properties exists, it is used as a fallback, allowing

#    migration from the PropertyFileSnitch.

#  - RackInferringSnitch:

#    Proximity is determined by rack and data center, which are

#    assumed to correspond to the 3rd and 2nd octet of each node's

#    IP address, respectively.  Unless this happens to match your

#    deployment conventions (as it did Facebook's), this is best used

#    as an example of writing a custom Snitch class.

#  - Ec2Snitch:

#    Appropriate for EC2 deployments in a single Region. Loads Region

#    and Availability Zone information from the EC2 API. The Region is

#    treated as the datacenter, and the Availability Zone as the rack.

#    Only private IPs are used, so this will not work across multiple

#    Regions.

#  - Ec2MultiRegionSnitch:

#    Uses public IPs as broadcast_address to allow cross-region

#    connectivity.  (Thus, you should set seed addresses to the public

#    IP as well.) You will need to open the storage_port or

#    ssl_storage_port on the public IP firewall.  (For intra-Region

#    traffic, Cassandra will switch to the private IP after

#    establishing a connection.)

#

# You can use a custom Snitch by setting this to the full class name

# of the snitch, which will be assumed to be on your classpath.

endpoint_snitch: SimpleSnitch

 

# controls how often to perform the more expensive part of host score

# calculation

dynamic_snitch_update_interval_in_ms: 100

# controls how often to reset all host scores, allowing a bad host to

# possibly recover

dynamic_snitch_reset_interval_in_ms: 600000

# if set greater than zero and read_repair_chance is < 1.0, this will allow

# 'pinning' of replicas to hosts in order to increase cache capacity.

# The badness threshold will control how much worse the pinned host has to be

# before the dynamic snitch will prefer other replicas over it.  This is

# expressed as a double which represents a percentage.  Thus, a value of

# 0.2 means Cassandra would continue to prefer the static snitch values

# until the pinned host was 20% worse than the fastest.

dynamic_snitch_badness_threshold: 0.1

 

# request_scheduler -- Set this to a class that implements

# RequestScheduler, which will schedule incoming client requests

# according to the specific policy. This is useful for multi-tenancy

# with a single Cassandra cluster.

# NOTE: This is specifically for requests from the client and does

# not affect inter node communication.

# org.apache.cassandra.scheduler.NoScheduler - No scheduling takes place

# org.apache.cassandra.scheduler.RoundRobinScheduler - Round robin of

# client requests to a node with a separate queue for each

# request_scheduler_id. The scheduler is further customized by

# request_scheduler_options as described below.

request_scheduler: org.apache.cassandra.scheduler.NoScheduler

 

# Scheduler Options vary based on the type of scheduler

# NoScheduler - Has no options

# RoundRobin

#  - throttle_limit -- The throttle_limit is the number of in-flight

#                      requests per client.  Requests beyond

#                      that limit are queued up until

#                      running requests can complete.

#                      The value of 80 here is twice the number of

#                      concurrent_reads + concurrent_writes.

#  - default_weight -- default_weight is optional and allows for

#                      overriding the default which is 1.

#  - weights -- Weights are optional and will default to 1 or the

#               overridden default_weight. The weight translates into how

#               many requests are handled during each turn of the

#               RoundRobin, based on the scheduler id.

#

# request_scheduler_options:

#    throttle_limit: 80

#    default_weight: 5

#    weights:

#      Keyspace1: 1

#      Keyspace2: 5

 

# request_scheduler_id -- An identifier based on which to perform

# the request scheduling. Currently the only valid option is keyspace.

# request_scheduler_id: keyspace

 

# index_interval controls the sampling of entries from the primrary

# row index in terms of space versus time.  The larger the interval,

# the smaller and less effective the sampling will be.  In technicial

# terms, the interval coresponds to the number of index entries that

# are skipped between taking each sample.  All the sampled entries

# must fit in memory.  Generally, a value between 128 and 512 here

# coupled with a large key cache size on CFs results in the best trade

# offs.  This value is not often changed, however if you have many

# very small rows (many to an OS page), then increasing this will

# often lower memory usage without a impact on performance.

index_interval: 128

 

# Enable or disable inter-node encryption

# Default settings are TLS v1, RSA 1024-bit keys (it is imperative that

# users generate their own keys) TLS_RSA_WITH_AES_128_CBC_SHA as the cipher

# suite for authentication, key exchange and encryption of the actual data transfers.

# NOTE: No custom encryption options are enabled at the moment

# The available internode options are : all, none, dc, rack

#

# If set to dc cassandra will encrypt the traffic between the DCs

# If set to rack cassandra will encrypt the traffic between the racks

#

# The passwords used in these options must match the passwords used when generating

# the keystore and truststore.  For instructions on generating these files, see:

# http://download.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#CreateKeystore

#

server_encryption_options:

    internode_encryption: none

    keystore: conf/.keystore

    keystore_password: cassandra

    truststore: conf/.truststore

    truststore_password: cassandra

    # More advanced defaults below:

    # protocol: TLS

    # algorithm: SunX509

    # store_type: JKS

    # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA]

    # require_client_auth: false

 

# enable or disable client/server encryption.

client_encryption_options:

    enabled: false

    keystore: conf/.keystore

    keystore_password: cassandra

    # require_client_auth: false

    # Set trustore and truststore_password if require_client_auth is true

    # truststore: conf/.truststore

    # truststore_password: cassandra

    # More advanced defaults below:

    # protocol: TLS

    # algorithm: SunX509

    # store_type: JKS

    # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA]

 

# internode_compression controls whether traffic between nodes is

# compressed.

# can be:  all  - all traffic is compressed

#          dc   - traffic between different datacenters is compressed

#          none - nothing is compressed.

internode_compression: all

 

# Enable or disable tcp_nodelay for inter-dc communication.

# Disabling it will result in larger (but fewer) network packets being sent,

# reducing overhead from the TCP protocol itself, at the cost of increasing

# latency if you block for cross-datacenter responses.

inter_dc_tcp_nodelay: true

 

cat /etc/cassandra/conf/cassandra-env.sh

# Licensed to the Apache Software Foundation (ASF) under one

# or more contributor license agreements.  See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership.  The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License.  You may obtain a copy of the License at

#

#     http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

 

calculate_heap_sizes()

{

    case "`uname`" in

        Linux)

            system_memory_in_mb=`free -m | awk '/Mem:/ {print $2}'`

            system_cpu_cores=`egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo`

        ;;

        FreeBSD)

            system_memory_in_bytes=`sysctl hw.physmem | awk '{print $2}'`

            system_memory_in_mb=`expr $system_memory_in_bytes / 1024 / 1024`

            system_cpu_cores=`sysctl hw.ncpu | awk '{print $2}'`

        ;;

        SunOS)

            system_memory_in_mb=`prtconf | awk '/Memory size:/ {print $3}'`

            system_cpu_cores=`psrinfo | wc -l`

        ;;

        Darwin)

            system_memory_in_bytes=`sysctl hw.memsize | awk '{print $2}'`

            system_memory_in_mb=`expr $system_memory_in_bytes / 1024 / 1024`

            system_cpu_cores=`sysctl hw.ncpu | awk '{print $2}'`

        ;;

        *)

            # assume reasonable defaults for e.g. a modern desktop or

            # cheap server

            system_memory_in_mb="2048"

            system_cpu_cores="2"

        ;;

    esac

 

    # some systems like the raspberry pi don't report cores, use at least 1

    if [ "$system_cpu_cores" -lt "1" ]

    then

        system_cpu_cores="1"

    fi

 

    # set max heap size based on the following

    # max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB))

    # calculate 1/2 ram and cap to 1024MB

    # calculate 1/4 ram and cap to 8192MB

    # pick the max

    half_system_memory_in_mb=`expr $system_memory_in_mb / 2`

    quarter_system_memory_in_mb=`expr $half_system_memory_in_mb / 2`

    if [ "$half_system_memory_in_mb" -gt "1024" ]

    then

        half_system_memory_in_mb="1024"

    fi

    if [ "$quarter_system_memory_in_mb" -gt "8192" ]

    then

        quarter_system_memory_in_mb="8192"

    fi

    if [ "$half_system_memory_in_mb" -gt "$quarter_system_memory_in_mb" ]

    then

        max_heap_size_in_mb="$half_system_memory_in_mb"

    else

        max_heap_size_in_mb="$quarter_system_memory_in_mb"

    fi

    MAX_HEAP_SIZE="${max_heap_size_in_mb}M"

 

    # Young gen: min(max_sensible_per_modern_cpu_core * num_cores, 1/4 * heap size)

    max_sensible_yg_per_core_in_mb="100"

    max_sensible_yg_in_mb=`expr $max_sensible_yg_per_core_in_mb "*" $system_cpu_cores`

 

    desired_yg_in_mb=`expr $max_heap_size_in_mb / 4`

 

    if [ "$desired_yg_in_mb" -gt "$max_sensible_yg_in_mb" ]

    then

        HEAP_NEWSIZE="${max_sensible_yg_in_mb}M"

    else

        HEAP_NEWSIZE="${desired_yg_in_mb}M"

    fi

}

 

# Determine the sort of JVM we'll be running on.

 

java_ver_output=`"${JAVA:-java}" -version 2>&1`

 

jvmver=`echo "$java_ver_output" | awk -F'"' 'NR==1 {print $2}'`

JVM_VERSION=${jvmver%_*}

JVM_PATCH_VERSION=${jvmver#*_}

 

jvm=`echo "$java_ver_output" | awk 'NR==2 {print $1}'`

case "$jvm" in

    OpenJDK)

        JVM_VENDOR=OpenJDK

        # this will be "64-Bit" or "32-Bit"

        JVM_ARCH=`echo "$java_ver_output" | awk 'NR==3 {print $2}'`

        ;;

    "Java(TM)")

        JVM_VENDOR=Oracle

        # this will be "64-Bit" or "32-Bit"

        JVM_ARCH=`echo "$java_ver_output" | awk 'NR==3 {print $3}'`

        ;;

    *)

        # Help fill in other JVM values

        JVM_VENDOR=other

        JVM_ARCH=unknown

        ;;

esac

 

 

# Override these to set the amount of memory to allocate to the JVM at

# start-up. For production use you may wish to adjust this for your

# environment. MAX_HEAP_SIZE is the total amount of memory dedicated

# to the Java heap; HEAP_NEWSIZE refers to the size of the young

# generation. Both MAX_HEAP_SIZE and HEAP_NEWSIZE should be either set

# or not (if you set one, set the other).

#

# The main trade-off for the young generation is that the larger it

# is, the longer GC pause times will be. The shorter it is, the more

# expensive GC will be (usually).

#

# The example HEAP_NEWSIZE assumes a modern 8-core+ machine for decent pause

# times. If in doubt, and if you do not particularly want to tweak, go with

# 100 MB per physical CPU core.

 

#MAX_HEAP_SIZE="4G"

#HEAP_NEWSIZE="800M"

 

if [ "x$MAX_HEAP_SIZE" = "x" ] && [ "x$HEAP_NEWSIZE" = "x" ]; then

    calculate_heap_sizes

else

    if [ "x$MAX_HEAP_SIZE" = "x" ] ||  [ "x$HEAP_NEWSIZE" = "x" ]; then

        echo "please set or unset MAX_HEAP_SIZE and HEAP_NEWSIZE in pairs (see cassandra-env.sh)"

        exit 1

    fi

fi

 

# Specifies the default port over which Cassandra will be available for

# JMX connections.

JMX_PORT="7199"

 

 

# Here we create the arguments that will get passed to the jvm when

# starting cassandra.

 

# enable assertions.  disabling this in production will give a modest

# performance benefit (around 5%).

JVM_OPTS="$JVM_OPTS -ea"

 

# add the jamm javaagent

if [ "$JVM_VENDOR" != "OpenJDK" -o "$JVM_VERSION" \> "1.6.0" ] \

      || [ "$JVM_VERSION" = "1.6.0" -a "$JVM_PATCH_VERSION" -ge 23 ]

then

    JVM_OPTS="$JVM_OPTS -javaagent:$CASSANDRA_HOME/lib/jamm-0.2.5.jar"

fi

 

# enable thread priorities, primarily so we can give periodic tasks

# a lower priority to avoid interfering with client workload

JVM_OPTS="$JVM_OPTS -XX:+UseThreadPriorities"

# allows lowering thread priority without being root.  see

# http://tech.stolsvik.com/2010/01/linux-java-thread-priorities-workaround.html

JVM_OPTS="$JVM_OPTS -XX:ThreadPriorityPolicy=42"

 

# min and max heap sizes should be set to the same value to avoid

# stop-the-world GC pauses during resize, and so that we can lock the

# heap in memory on startup to prevent any of it from being swapped

# out.

JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}"

JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}"

JVM_OPTS="$JVM_OPTS -Xmn${HEAP_NEWSIZE}"

JVM_OPTS="$JVM_OPTS -XX:+HeapDumpOnOutOfMemoryError"

 

# set jvm HeapDumpPath with CASSANDRA_HEAPDUMP_DIR

if [ "x$CASSANDRA_HEAPDUMP_DIR" != "x" ]; then

    JVM_OPTS="$JVM_OPTS -XX:HeapDumpPath=$CASSANDRA_HEAPDUMP_DIR/cassandra-`date +%s`-pid$$.hprof"

fi

 

 

startswith() { [ "${1#$2}" != "$1" ]; }

 

if [ "`uname`" = "Linux" ] ; then

    # reduce the per-thread stack size to minimize the impact of Thrift

    # thread-per-client.  (Best practice is for client connections to

    # be pooled anyway.) Only do so on Linux where it is known to be

    # supported.

    # u34 and greater need 180k

    #JVM_OPTS="$JVM_OPTS -Xss180k"

    JVM_OPTS="$JVM_OPTS -Xss256k"

fi

echo "xss = $JVM_OPTS"

 

# GC tuning options

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

# note: bash evals '1.7.x' as > '1.7' so this is really a >= 1.7 jvm check

if [ "$JVM_VERSION" \> "1.7" ] ; then

    JVM_OPTS="$JVM_OPTS -XX:+UseCondCardMark"

fi

 

# GC logging options -- uncomment to enable

# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"

# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDateStamps"

# JVM_OPTS="$JVM_OPTS -XX:+PrintHeapAtGC"

# JVM_OPTS="$JVM_OPTS -XX:+PrintTenuringDistribution"

# JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime"

# JVM_OPTS="$JVM_OPTS -XX:+PrintPromotionFailure"

# JVM_OPTS="$JVM_OPTS -XX:PrintFLSStatistics=1"

# JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc-`date +%s`.log"

# If you are using JDK 6u34 7u2 or later you can enable GC log rotation

# don't stick the date in the log name if rotation is on.

# JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc.log"

# JVM_OPTS="$JVM_OPTS -XX:+UseGCLogFileRotation"

# JVM_OPTS="$JVM_OPTS -XX:NumberOfGCLogFiles=10"

# JVM_OPTS="$JVM_OPTS -XX:GCLogFileSize=10M"

 

# uncomment to have Cassandra JVM listen for remote debuggers/profilers on port 1414

# JVM_OPTS="$JVM_OPTS -Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1414"

 

# Prefer binding to IPv4 network intefaces (when net.ipv6.bindv6only=1). See

# http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6342561 (short version:

# comment out this entry to enable IPv6 support).

JVM_OPTS="$JVM_OPTS -Djava.net.preferIPv4Stack=true"

 

# jmx: metrics and administration interface

#

# add this if you're having trouble connecting:

# JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=<public name>"

#JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=10.201.3.80

#

# see

# https://blogs.oracle.com/jmxetc/entry/troubleshooting_connection_problems_in_jconsole

# for more on configuring JMX through firewalls, etc. (Short version:

# get it working with no firewall first.)

JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.port=$JMX_PORT"

JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.ssl=false"

JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=false"

JVM_OPTS="$JVM_OPTS $JVM_EXTRA_OPTS"

 

Eric W. Marshall
emarshall@pulsepoint.com

646.421.6702 office
732.991.3856 cell