Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Sun, 25 Apr 2010 21:36:20 -0000
Message-ID: <20100425213620.25755.4572@eos.apache.org>
Subject: 
 =?utf-8?q?=5BCassandra_Wiki=5D_Update_of_=22StorageConfiguration=5F0=2E7?=
 =?utf-8?q?=22_by_ToddBlose?=

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for=
 change notification.

The "StorageConfiguration_0.7" page has been changed by ToddBlose.
The comment on this change is: Saving this here temporarily. Will copy over=
 once we get closer to 0.7 release..
http://wiki.apache.org/cassandra/StorageConfiguration_0.7

--------------------------------------------------

New page:
Cassandra storage configuration is described by the ''conf/cassandra.yaml''=
 file. As the syntax evolves with releases, this wiki page tries to documen=
t those changes using ''[New in X.Y: ....]'' lines.

''[New in 0.7:'' The configuration file format has changed to YAML http://e=
n.wikipedia.org/wiki/Yaml]

=3D=3D AutoBootstrap =3D=3D
''[New in 0.5:''

Turn on to make new [non-seed] nodes automatically migrate the right data  =
to themselves.  (If no InitialToken is specified, they will pick one  such =
that they will get half the range of the most-loaded node.) If a node start=
s up without bootstrapping, it will mark itself bootstrapped so that you ca=
n't subsequently accidently bootstrap a node with data on it.  (You can res=
et this by wiping your data and commitlog directories.)

Off by default so that new clusters and upgraders from 0.4 don't bootstrap =
immediately.  You should turn this on when you start adding new nodes to a =
cluster that already has data on it.  (If you are upgrading from 0.4, start=
 your cluster with it off once before changing it to true. Otherwise, no da=
ta will be lost but you will incur a lot of unnecessary I/O before your clu=
ster starts up.)

{{{
  auto_bootstrap: false
}}}
'']''

=3D=3D Cluster Name =3D=3D
The name of this cluster.  This is mainly used to prevent machines in one l=
ogical cluster from joining another.

Example:

{{{
cluster_name: 'Test Cluster'
}}}
=3D=3D Authenticator =3D=3D
''[New in 0.6:''

Allows for pluggable authentication of users, which defines whether it is n=
ecessary to call the Thrift 'login' method, and which parameters are requir=
ed to login. The default '!AllowAllAuthenticator' does not require users to=
 call 'login': any user can perform any operation. The other built in optio=
n is '!SimpleAuthenticator', which requires users and passwords to be defin=
ed in property files, and for users to call login with a valid combo.

Example:

{{{
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
}}}
'']''

=3D=3D=3D EndPointSnitch =3D=3D=3D
!EndPointSnitch: Setting this to the class that implements {{{IEndPointSnit=
ch}}} which will see if two endpoints are in the same data center or on the=
 same rack. Out of the box, Cassandra provides {{{org.apache.cassandra.loca=
tor.EndPointSnitch}}}

{{{
endpoint_snitch: org.apache.cassandra.locator.SimpleSnitch
}}}
Note: this class will work on hosts' IPs only. There is no configuration pa=
rameter to tell Cassandra that a node is in rack ''R'' and in datacenter ''=
D''. The current rules are based on the two methods: (see [[http://svn.apac=
he.org/viewvc/incubator/cassandra/trunk/src/java/org/apache/cassandra/locat=
or/EndPointSnitch.java?view=3Dmarkup|EndPointSnitch.java]]):

 * isOnSameRack: Look at the IP Address of the two hosts. Compare the 3rd o=
ctet. If they are the same then the hosts are in the same rack else differe=
nt racks.

 * isInSameDataCenter: Look at the IP Address of the two hosts. Compare the=
 2nd octet. If they are the same then the hosts are in the same datacenter =
else different datacenter.

=3D=3D Keyspaces and ColumnFamilies =3D=3D
Keyspaces and {{{ColumnFamilies}}}: A {{{ColumnFamily}}} is the Cassandra c=
oncept closest to a relational table.  {{{Keyspaces}}} are separate groups =
of {{{ColumnFamilies}}}.  Except in very unusual circumstances you will hav=
e one Keyspace per application.

There is an implicit keyspace named 'system' for Cassandra internals.

{{{
keyspaces:
    - name: Keyspace1
}}}
''[New in 0.5:''

The fraction of keys per sstable whose locations we keep in memory in "most=
ly LRU" order.  (JUST the key locations, NOT any column values.) The amount=
 of memory used by the default setting of 0.01 is comparable to the amount =
used by the internal per-sstable key index. Consider increasing this if you=
 have fewer, wider rows. Set to 0 to disable entirely.

{{{
      <KeysCachedFraction>0.01</KeysCachedFraction>
}}}
'']''

''[New in 0.6: !EndPointSnitch, !ReplicaPlacementStrategy and !ReplicationF=
actor became configurable per keyspace.  Prior to that they were global set=
tings.]''

=3D=3D=3D ReplicaPlacementStrategy and ReplicationFactor =3D=3D=3D
Strategy: Setting this to the class that implements {{{IReplicaPlacementStr=
ategy}}} will change the way the node picker works. Out of the box, Cassand=
ra provides {{{org.apache.cassandra.locator.RackUnawareStrategy}}} and {{{o=
rg.apache.cassandra.locator.RackAwareStrategy}}} (place one replica in a di=
fferent datacenter, and the others on different racks in the same one.)

{{{
replica_placement_strategy: org.apache.cassandra.locator.RackUnawareStrategy
}}}
Number of replicas of the data

{{{
replication_factor: 1
}}}
=3D=3D=3D ColumnFamilies =3D=3D=3D
The {{{CompareWith}}} attribute tells Cassandra how to sort the columns for=
 slicing operations.  The default is {{{BytesType}}}, which is a straightfo=
rward lexical comparison of the bytes in each column. Other options are {{{=
AsciiType}}}, {{{UTF8Type}}}, {{{LexicalUUIDType}}}, {{{TimeUUIDType}}}, an=
d {{{LongType}}}.  You can also specify the fully-qualified class name to a=
 class of your choice extending {{{org.apache.cassandra.db.marshal.Abstract=
Type}}}.

 * {{{SuperColumns}}} have a similar {{{CompareSubcolumnsWith}}} attribute.
 * {{{BytesType}}}: Simple sort by byte value.  No validation is performed.
 * {{{AsciiType}}}: Like {{{BytesType}}}, but validates that the input can =
be parsed as US-ASCII.
 * {{{UTF8Type}}}: A string encoded as UTF8
 * {{{LongType}}}: A 64bit long
 * {{{LexicalUUIDType}}}: A 128bit UUID, compared lexically (by byte value)
 * {{{TimeUUIDType}}}: a 128bit version 1 UUID, compared by timestamp

(To get the closest approximation to 0.3-style {{{supercolumns}}}, you woul=
d use {{{CompareWith=3DUTF8Type CompareSubcolumnsWith=3DLongType}}}.)

If {{{FlushPeriodInMinutes}}} is configured and positive, it will be flushe=
d to disk with that period whether it is dirty or not.  This is intended fo=
r lightly-used {{{columnfamilies}}} so that they do not prevent commitlog s=
egments from being purged.

''[New in 0.5:'' An optional `Comment` attribute may be used to attach addi=
tional human-readable information about the column family to its definition=
. '']''

{{{
      column_families:
        - name: Standard1
          compare_with: BytesType

        - name: Standard2
          compare_with: UTF8Type
          read_repair_chance: 0.1
          keys_cached: 100

        - name: StandardByUUID1
          compare_with: TimeUUIDType

        - name: Super1
          column_type: Super
          compare_with: BytesType
          compare_subcolumns_with: BytesType

        - name: Super2
          column_type: Super
          compare_subcolumns_with: UTF8Type
          preloadRowCache: true
          rows_cached: 10000
          keys_cached: 50
          comment: 'A column family with supercolumns, whose column and sub=
column names are UTF8 strings'
}}}
=3D=3D Partitioner =3D=3D
Partitioner: any {{{IPartitioner}}} may be used, including your own as long=
 as it is on the classpath.  Out of the box, Cassandra provides {{{org.apac=
he.cassandra.dht.RandomPartitioner}}}, {{{org.apache.cassandra.dht.OrderPre=
servingPartitioner}}}, and {{{org.apache.cassandra.dht.CollatingOrderPreser=
vingPartitioner}}}. (CollatingOPP colates according to EN,US rules, not nai=
ve byte ordering.  Use this as an example if you need locale-aware collatio=
n.) Range queries require using an order-preserving partitioner.

Achtung!  Changing this parameter requires wiping your data directories, si=
nce the partitioner can modify the !sstable on-disk format.

Example:

{{{
partitioner: org.apache.cassandra.dht.RandomPartitioner
}}}
If you are using an order-preserving partitioner and you know your key dist=
ribution, you can specify the token for this node to use. (Keys are sent to=
 the node with the "closest" token, so distributing your tokens equally alo=
ng the key distribution space will spread keys evenly across your cluster.)=
  This setting is only checked the first time a node is started.

This can also be useful with {{{RandomPartitioner}}} to force equal spacing=
 of tokens around the hash space, especially for clusters with a small numb=
er of nodes.

{{{
initial_token:
}}}
Cassandra uses MD5 hash internally to hash the keys to place on the ring in=
 a {{{RandomPartitioner}}}. So it makes sense to divide the hash space equa=
lly by the number of machines available using {{{InitialToken}}} ie, If the=
re are 10 machines, each will handle 1/10th of maximum hash value) and expe=
ct that the machines will get a reasonably equal load.

With {{{OrderPreservingPartitioner}}} the keys themselves are used to place=
 on the ring. One of the potential drawback of this approach is that if row=
s are inserted with sequential keys, all the write load will go to the same=
 node.

=3D=3D Directories =3D=3D
Directories: Specify where Cassandra should store different data on disk.  =
Keep the data disks and the {{{CommitLog}}} disks separate for best perform=
ance. See also [[FAQ#what_kind_of_hardware_should_i_use|what kind of hardwa=
re should I use?]]

{{{
commitlog_directory: /var/lib/cassandra/commitlog
data_file_directories:
    - /var/lib/cassandra/data
}}}
=3D=3D Seeds =3D=3D
Addresses of hosts that are deemed contact points. Cassandra nodes use this=
 list of hosts to find each other and learn the topology of the ring. You m=
ust change this if you are running multiple nodes!

{{{
seeds:
    - 127.0.0.1
}}}
Never use a node's own address as a seed if you are bootstrapping it by set=
ting AutoBootstrap to true.

=3D=3D Miscellaneous =3D=3D
Time to wait for a reply from other nodes before failing the command

{{{
rpc_timeout_in_ms: 5000
}}}
Size to allow commitlog to grow to before creating a new segment

{{{
commitlog_rotation_threshold_in_mb: 128
}}}
Local hosts and ports

Address to bind to and tell other nodes to connect to.  You _must_ change t=
his if you want multiple nodes to be able to communicate!

Leaving it blank leaves it up to {{{InetAddress.getLocalHost()}}}. This wil=
l always do the Right Thing *if* the node is properly configured (hostname,=
 name resolution, etc), and the Right Thing is to use the address associate=
d with the hostname (it might not be).  The ControlPort setting is deprecat=
ed in 0.6 and can be safely removed from configuration.

{{{
listen_address: localhost
<!-- TCP port, for commands and data -->
storage_port: 7000
}}}
The address to bind the Thrift RPC service to. Unlike {{{ListenAddress}}} a=
bove, you *can* specify {{{0.0.0.0}}} here if you want Thrift to listen on =
all interfaces.

Leaving this blank has the same effect it does for {{{ListenAddress}}}, (i.=
e. it will be based on the configured hostname of the node).

{{{
rpc_address: localhost
<!-- Thrift RPC port (the port clients connect to). -->
rpc_port: 9160
}}}
Whether or not to use a framed transport for Thrift. If this option is set =
to true then you must also use a framed transport on the  client-side, (fra=
med and non-framed transports are not compatible).

{{{
thrift_framed_transport: false
}}}
=3D=3D Memory, Disk, and Performance =3D=3D
Buffer size to use when performing contiguous column slices. Increase this =
to the size of the column slices you typically perform.  (Name-based querie=
s are performed with a buffer size of  !ColumnIndexSizeInKB.)

{{{
sliced_buffer_size_in_kb: 64
}}}
Buffer size to use when flushing !memtables to disk. (Only one  !memtable i=
s ever flushed at a time.) Increase (decrease) the index buffer size relati=
ve to the data buffer if you have few (many)  columns per key.  Bigger is o=
nly better _if_ your !memtables get large enough to use the space. (Check i=
n your data directory after your app has been running long enough.)

{{{
flush_data_buffer_size_in_mb: 32
flush_index_buffer_size_in_mb: 8
}}}
Add column indexes to a row after its contents reach this size. Increase if=
 your column values are large, or if you have a very large number of column=
s.  The competing causes are, Cassandra has to deserialize this much of the=
 row to read a single column, so you want it to be small - at least if you =
do many partial-row reads - but all the index data is read for each access,=
 so you don't want to generate that wastefully either.

{{{
column_index_size_in_kb: 64
}}}
The maximum amount of data to store in memory per !ColumnFamily before flus=
hing to disk.  Note: There is one memtable per column family, and  this thr=
eshold is based solely on the amount of data stored, not actual heap memory=
 usage (there is some overhead in indexing the columns). See also MemtableT=
hresholds.

{{{
memtable_throughput_in_mb: 64
}}}
The maximum number of columns in millions to store in memory per ColumnFami=
ly before flushing to disk.  This is also a per-memtable setting.  Use with=
 {{{MemtableSizeInMB}}} to tune memory usage.

{{{
memtable_operations_in_millions: 0.3
}}}
''[New in 0.5''

The maximum time to leave a dirty memtable unflushed. (While any affected c=
olumnfamilies have unflushed data from a commit log segment, that segment c=
annot be deleted.) This needs to be large enough that it won't cause a flus=
h storm of all your memtables flushing at once because none has hit the siz=
e or count thresholds yet.  For production, a larger value such as 1440 is =
recommended.

{{{
memtable_flush_after_mins: 60
}}}
'']''

Unlike most systems, in Cassandra writes are faster than reads, so you can =
afford more of those in parallel.  A good rule of thumb is 2 concurrent rea=
ds per processor core.  Increase {{{ConcurrentWrites}}} to the number of cl=
ients writing at once if you enable {{{CommitLogSync + CommitLogSyncDelay}}=
}.

{{{
concurrent_reads: 8
concurrent_writes: 32
}}}
{{{CommitLogSync}}} may be either "periodic" or "batch."  When in batch mod=
e, Cassandra won't ack writes until the commit log has been fsynced to disk=
.  It will wait up to {{{CommitLogSyncBatchWindowInMS}}} milliseconds for o=
ther writes, before performing the sync.

This is less necessary in Cassandra than in traditional databases since rep=
lication reduces the odds of losing data from a failure after writing the l=
og entry but before it actually reaches the disk. So the other option is "t=
imed," where writes may be acked immediately and the {{{CommitLog}}} is sim=
ply synced every {{{CommitLogSyncPeriodInMS}}} milliseconds.

{{{
commitlog_sync: periodic
}}}
Interval at which to perform syncs of the {{{CommitLog}}} in periodic mode.=
 Usually the default of 1000ms is fine; increase it only if the CommitLog P=
endingTasks backlog in jmx shows that you are frequently scheduling a secon=
d sync while the first has not yet been processed.

{{{
commitlog_sync_period_in_ms: 1000
}}}
Delay (in milliseconds) during which additional commit log entries may be w=
ritten before fsync in batch mode.  This will increase latency slightly, bu=
t can vastly improve throughput where there are many writers.  Set to zero =
to disable (each entry will be synced individually).  Reasonable values ran=
ge from a minimal 0.1 to 10 or even more if throughput matters more than la=
tency.

{{{
# commitlog_sync_batch_window_in_ms: 1
}}}
Time to wait before garbage-collection deletion markers.  Set this to a lar=
ge enough value that you are confident that the deletion marker will be pro=
pagated to all replicas by the time this many seconds has elapsed, even in =
the face of hardware failures.  The default value is ten days.

{{{
gc_grace_seconds: 864000
}}}
The threshold size in megabytes the binary memtable must grow to, before it=
's submitted for flushing to disk.

{{{
binary_memtable_throughput_in_mb: 256
}}}