Return-Path: Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: (qmail 2510 invoked from network); 25 Apr 2010 21:36:43 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 25 Apr 2010 21:36:43 -0000 Received: (qmail 79021 invoked by uid 500); 25 Apr 2010 21:36:43 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 79005 invoked by uid 500); 25 Apr 2010 21:36:43 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 78997 invoked by uid 500); 25 Apr 2010 21:36:43 -0000 Delivered-To: apmail-incubator-cassandra-commits@incubator.apache.org Received: (qmail 78994 invoked by uid 99); 25 Apr 2010 21:36:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Apr 2010 21:36:43 +0000 X-ASF-Spam-Status: No, hits=-1403.0 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Apr 2010 21:36:41 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id EBE7716E29; Sun, 25 Apr 2010 21:36:20 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Sun, 25 Apr 2010 21:36:20 -0000 Message-ID: <20100425213620.25755.4572@eos.apache.org> Subject: =?utf-8?q?=5BCassandra_Wiki=5D_Update_of_=22StorageConfiguration=5F0=2E7?= =?utf-8?q?=22_by_ToddBlose?= Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for= change notification. The "StorageConfiguration_0.7" page has been changed by ToddBlose. The comment on this change is: Saving this here temporarily. Will copy over= once we get closer to 0.7 release.. http://wiki.apache.org/cassandra/StorageConfiguration_0.7 -------------------------------------------------- New page: Cassandra storage configuration is described by the ''conf/cassandra.yaml''= file. As the syntax evolves with releases, this wiki page tries to documen= t those changes using ''[New in X.Y: ....]'' lines. ''[New in 0.7:'' The configuration file format has changed to YAML http://e= n.wikipedia.org/wiki/Yaml] =3D=3D AutoBootstrap =3D=3D ''[New in 0.5:'' Turn on to make new [non-seed] nodes automatically migrate the right data = to themselves. (If no InitialToken is specified, they will pick one such = that they will get half the range of the most-loaded node.) If a node start= s up without bootstrapping, it will mark itself bootstrapped so that you ca= n't subsequently accidently bootstrap a node with data on it. (You can res= et this by wiping your data and commitlog directories.) Off by default so that new clusters and upgraders from 0.4 don't bootstrap = immediately. You should turn this on when you start adding new nodes to a = cluster that already has data on it. (If you are upgrading from 0.4, start= your cluster with it off once before changing it to true. Otherwise, no da= ta will be lost but you will incur a lot of unnecessary I/O before your clu= ster starts up.) {{{ auto_bootstrap: false }}} '']'' =3D=3D Cluster Name =3D=3D The name of this cluster. This is mainly used to prevent machines in one l= ogical cluster from joining another. Example: {{{ cluster_name: 'Test Cluster' }}} =3D=3D Authenticator =3D=3D ''[New in 0.6:'' Allows for pluggable authentication of users, which defines whether it is n= ecessary to call the Thrift 'login' method, and which parameters are requir= ed to login. The default '!AllowAllAuthenticator' does not require users to= call 'login': any user can perform any operation. The other built in optio= n is '!SimpleAuthenticator', which requires users and passwords to be defin= ed in property files, and for users to call login with a valid combo. Example: {{{ authenticator: org.apache.cassandra.auth.AllowAllAuthenticator }}} '']'' =3D=3D=3D EndPointSnitch =3D=3D=3D !EndPointSnitch: Setting this to the class that implements {{{IEndPointSnit= ch}}} which will see if two endpoints are in the same data center or on the= same rack. Out of the box, Cassandra provides {{{org.apache.cassandra.loca= tor.EndPointSnitch}}} {{{ endpoint_snitch: org.apache.cassandra.locator.SimpleSnitch }}} Note: this class will work on hosts' IPs only. There is no configuration pa= rameter to tell Cassandra that a node is in rack ''R'' and in datacenter ''= D''. The current rules are based on the two methods: (see [[http://svn.apac= he.org/viewvc/incubator/cassandra/trunk/src/java/org/apache/cassandra/locat= or/EndPointSnitch.java?view=3Dmarkup|EndPointSnitch.java]]): * isOnSameRack: Look at the IP Address of the two hosts. Compare the 3rd o= ctet. If they are the same then the hosts are in the same rack else differe= nt racks. * isInSameDataCenter: Look at the IP Address of the two hosts. Compare the= 2nd octet. If they are the same then the hosts are in the same datacenter = else different datacenter. =3D=3D Keyspaces and ColumnFamilies =3D=3D Keyspaces and {{{ColumnFamilies}}}: A {{{ColumnFamily}}} is the Cassandra c= oncept closest to a relational table. {{{Keyspaces}}} are separate groups = of {{{ColumnFamilies}}}. Except in very unusual circumstances you will hav= e one Keyspace per application. There is an implicit keyspace named 'system' for Cassandra internals. {{{ keyspaces: - name: Keyspace1 }}} ''[New in 0.5:'' The fraction of keys per sstable whose locations we keep in memory in "most= ly LRU" order. (JUST the key locations, NOT any column values.) The amount= of memory used by the default setting of 0.01 is comparable to the amount = used by the internal per-sstable key index. Consider increasing this if you= have fewer, wider rows. Set to 0 to disable entirely. {{{ 0.01 }}} '']'' ''[New in 0.6: !EndPointSnitch, !ReplicaPlacementStrategy and !ReplicationF= actor became configurable per keyspace. Prior to that they were global set= tings.]'' =3D=3D=3D ReplicaPlacementStrategy and ReplicationFactor =3D=3D=3D Strategy: Setting this to the class that implements {{{IReplicaPlacementStr= ategy}}} will change the way the node picker works. Out of the box, Cassand= ra provides {{{org.apache.cassandra.locator.RackUnawareStrategy}}} and {{{o= rg.apache.cassandra.locator.RackAwareStrategy}}} (place one replica in a di= fferent datacenter, and the others on different racks in the same one.) {{{ replica_placement_strategy: org.apache.cassandra.locator.RackUnawareStrategy }}} Number of replicas of the data {{{ replication_factor: 1 }}} =3D=3D=3D ColumnFamilies =3D=3D=3D The {{{CompareWith}}} attribute tells Cassandra how to sort the columns for= slicing operations. The default is {{{BytesType}}}, which is a straightfo= rward lexical comparison of the bytes in each column. Other options are {{{= AsciiType}}}, {{{UTF8Type}}}, {{{LexicalUUIDType}}}, {{{TimeUUIDType}}}, an= d {{{LongType}}}. You can also specify the fully-qualified class name to a= class of your choice extending {{{org.apache.cassandra.db.marshal.Abstract= Type}}}. * {{{SuperColumns}}} have a similar {{{CompareSubcolumnsWith}}} attribute. * {{{BytesType}}}: Simple sort by byte value. No validation is performed. * {{{AsciiType}}}: Like {{{BytesType}}}, but validates that the input can = be parsed as US-ASCII. * {{{UTF8Type}}}: A string encoded as UTF8 * {{{LongType}}}: A 64bit long * {{{LexicalUUIDType}}}: A 128bit UUID, compared lexically (by byte value) * {{{TimeUUIDType}}}: a 128bit version 1 UUID, compared by timestamp (To get the closest approximation to 0.3-style {{{supercolumns}}}, you woul= d use {{{CompareWith=3DUTF8Type CompareSubcolumnsWith=3DLongType}}}.) If {{{FlushPeriodInMinutes}}} is configured and positive, it will be flushe= d to disk with that period whether it is dirty or not. This is intended fo= r lightly-used {{{columnfamilies}}} so that they do not prevent commitlog s= egments from being purged. ''[New in 0.5:'' An optional `Comment` attribute may be used to attach addi= tional human-readable information about the column family to its definition= . '']'' {{{ column_families: - name: Standard1 compare_with: BytesType - name: Standard2 compare_with: UTF8Type read_repair_chance: 0.1 keys_cached: 100 - name: StandardByUUID1 compare_with: TimeUUIDType - name: Super1 column_type: Super compare_with: BytesType compare_subcolumns_with: BytesType - name: Super2 column_type: Super compare_subcolumns_with: UTF8Type preloadRowCache: true rows_cached: 10000 keys_cached: 50 comment: 'A column family with supercolumns, whose column and sub= column names are UTF8 strings' }}} =3D=3D Partitioner =3D=3D Partitioner: any {{{IPartitioner}}} may be used, including your own as long= as it is on the classpath. Out of the box, Cassandra provides {{{org.apac= he.cassandra.dht.RandomPartitioner}}}, {{{org.apache.cassandra.dht.OrderPre= servingPartitioner}}}, and {{{org.apache.cassandra.dht.CollatingOrderPreser= vingPartitioner}}}. (CollatingOPP colates according to EN,US rules, not nai= ve byte ordering. Use this as an example if you need locale-aware collatio= n.) Range queries require using an order-preserving partitioner. Achtung! Changing this parameter requires wiping your data directories, si= nce the partitioner can modify the !sstable on-disk format. Example: {{{ partitioner: org.apache.cassandra.dht.RandomPartitioner }}} If you are using an order-preserving partitioner and you know your key dist= ribution, you can specify the token for this node to use. (Keys are sent to= the node with the "closest" token, so distributing your tokens equally alo= ng the key distribution space will spread keys evenly across your cluster.)= This setting is only checked the first time a node is started. This can also be useful with {{{RandomPartitioner}}} to force equal spacing= of tokens around the hash space, especially for clusters with a small numb= er of nodes. {{{ initial_token: }}} Cassandra uses MD5 hash internally to hash the keys to place on the ring in= a {{{RandomPartitioner}}}. So it makes sense to divide the hash space equa= lly by the number of machines available using {{{InitialToken}}} ie, If the= re are 10 machines, each will handle 1/10th of maximum hash value) and expe= ct that the machines will get a reasonably equal load. With {{{OrderPreservingPartitioner}}} the keys themselves are used to place= on the ring. One of the potential drawback of this approach is that if row= s are inserted with sequential keys, all the write load will go to the same= node. =3D=3D Directories =3D=3D Directories: Specify where Cassandra should store different data on disk. = Keep the data disks and the {{{CommitLog}}} disks separate for best perform= ance. See also [[FAQ#what_kind_of_hardware_should_i_use|what kind of hardwa= re should I use?]] {{{ commitlog_directory: /var/lib/cassandra/commitlog data_file_directories: - /var/lib/cassandra/data }}} =3D=3D Seeds =3D=3D Addresses of hosts that are deemed contact points. Cassandra nodes use this= list of hosts to find each other and learn the topology of the ring. You m= ust change this if you are running multiple nodes! {{{ seeds: - 127.0.0.1 }}} Never use a node's own address as a seed if you are bootstrapping it by set= ting AutoBootstrap to true. =3D=3D Miscellaneous =3D=3D Time to wait for a reply from other nodes before failing the command {{{ rpc_timeout_in_ms: 5000 }}} Size to allow commitlog to grow to before creating a new segment {{{ commitlog_rotation_threshold_in_mb: 128 }}} Local hosts and ports Address to bind to and tell other nodes to connect to. You _must_ change t= his if you want multiple nodes to be able to communicate! Leaving it blank leaves it up to {{{InetAddress.getLocalHost()}}}. This wil= l always do the Right Thing *if* the node is properly configured (hostname,= name resolution, etc), and the Right Thing is to use the address associate= d with the hostname (it might not be). The ControlPort setting is deprecat= ed in 0.6 and can be safely removed from configuration. {{{ listen_address: localhost storage_port: 7000 }}} The address to bind the Thrift RPC service to. Unlike {{{ListenAddress}}} a= bove, you *can* specify {{{0.0.0.0}}} here if you want Thrift to listen on = all interfaces. Leaving this blank has the same effect it does for {{{ListenAddress}}}, (i.= e. it will be based on the configured hostname of the node). {{{ rpc_address: localhost rpc_port: 9160 }}} Whether or not to use a framed transport for Thrift. If this option is set = to true then you must also use a framed transport on the client-side, (fra= med and non-framed transports are not compatible). {{{ thrift_framed_transport: false }}} =3D=3D Memory, Disk, and Performance =3D=3D Buffer size to use when performing contiguous column slices. Increase this = to the size of the column slices you typically perform. (Name-based querie= s are performed with a buffer size of !ColumnIndexSizeInKB.) {{{ sliced_buffer_size_in_kb: 64 }}} Buffer size to use when flushing !memtables to disk. (Only one !memtable i= s ever flushed at a time.) Increase (decrease) the index buffer size relati= ve to the data buffer if you have few (many) columns per key. Bigger is o= nly better _if_ your !memtables get large enough to use the space. (Check i= n your data directory after your app has been running long enough.) {{{ flush_data_buffer_size_in_mb: 32 flush_index_buffer_size_in_mb: 8 }}} Add column indexes to a row after its contents reach this size. Increase if= your column values are large, or if you have a very large number of column= s. The competing causes are, Cassandra has to deserialize this much of the= row to read a single column, so you want it to be small - at least if you = do many partial-row reads - but all the index data is read for each access,= so you don't want to generate that wastefully either. {{{ column_index_size_in_kb: 64 }}} The maximum amount of data to store in memory per !ColumnFamily before flus= hing to disk. Note: There is one memtable per column family, and this thr= eshold is based solely on the amount of data stored, not actual heap memory= usage (there is some overhead in indexing the columns). See also MemtableT= hresholds. {{{ memtable_throughput_in_mb: 64 }}} The maximum number of columns in millions to store in memory per ColumnFami= ly before flushing to disk. This is also a per-memtable setting. Use with= {{{MemtableSizeInMB}}} to tune memory usage. {{{ memtable_operations_in_millions: 0.3 }}} ''[New in 0.5'' The maximum time to leave a dirty memtable unflushed. (While any affected c= olumnfamilies have unflushed data from a commit log segment, that segment c= annot be deleted.) This needs to be large enough that it won't cause a flus= h storm of all your memtables flushing at once because none has hit the siz= e or count thresholds yet. For production, a larger value such as 1440 is = recommended. {{{ memtable_flush_after_mins: 60 }}} '']'' Unlike most systems, in Cassandra writes are faster than reads, so you can = afford more of those in parallel. A good rule of thumb is 2 concurrent rea= ds per processor core. Increase {{{ConcurrentWrites}}} to the number of cl= ients writing at once if you enable {{{CommitLogSync + CommitLogSyncDelay}}= }. {{{ concurrent_reads: 8 concurrent_writes: 32 }}} {{{CommitLogSync}}} may be either "periodic" or "batch." When in batch mod= e, Cassandra won't ack writes until the commit log has been fsynced to disk= . It will wait up to {{{CommitLogSyncBatchWindowInMS}}} milliseconds for o= ther writes, before performing the sync. This is less necessary in Cassandra than in traditional databases since rep= lication reduces the odds of losing data from a failure after writing the l= og entry but before it actually reaches the disk. So the other option is "t= imed," where writes may be acked immediately and the {{{CommitLog}}} is sim= ply synced every {{{CommitLogSyncPeriodInMS}}} milliseconds. {{{ commitlog_sync: periodic }}} Interval at which to perform syncs of the {{{CommitLog}}} in periodic mode.= Usually the default of 1000ms is fine; increase it only if the CommitLog P= endingTasks backlog in jmx shows that you are frequently scheduling a secon= d sync while the first has not yet been processed. {{{ commitlog_sync_period_in_ms: 1000 }}} Delay (in milliseconds) during which additional commit log entries may be w= ritten before fsync in batch mode. This will increase latency slightly, bu= t can vastly improve throughput where there are many writers. Set to zero = to disable (each entry will be synced individually). Reasonable values ran= ge from a minimal 0.1 to 10 or even more if throughput matters more than la= tency. {{{ # commitlog_sync_batch_window_in_ms: 1 }}} Time to wait before garbage-collection deletion markers. Set this to a lar= ge enough value that you are confident that the deletion marker will be pro= pagated to all replicas by the time this many seconds has elapsed, even in = the face of hardware failures. The default value is ten days. {{{ gc_grace_seconds: 864000 }}} The threshold size in megabytes the binary memtable must grow to, before it= 's submitted for flushing to disk. {{{ binary_memtable_throughput_in_mb: 256 }}}