cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "DataModel" by EricEvans
Date Wed, 06 May 2009 02:47:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The following page has been changed by EricEvans:
http://wiki.apache.org/cassandra/DataModel

The comment on the change is:
imported from confluence wiki

New page:
= Introduction =

Basic unit of access control within Cassandra is a Column Family. A table in Cassandra is
made up of one or many column families. A row in a table is uniquely identified using a unique
key. The is key is a string and can be of any size. The number of column families and the
name of each column family must currently be fixed at the time the cluster is started. There
is no limitation on the number of column families but it is expected that there would be relatively
few of these. A column family can be of one of two type: Simple or Super. Columns within both
of these are dynamically created and there is no limit on the number of these. Columns are
constructs that are uniquely identified by a name, a value and a user-defined time stamp.
The number of columns that can be contained in a column family could be very large. This can
also vary per key. For instance key K1 could have 1024 columns/supercolumns while key K2 could
have 64 columns/supercolumns. Supercolumns are constructs t
 hat have a name and an infinite number of columns associated with them. The number of supercolumns
associated with any column family may be very large. They exhibit the same characteristics
as columns. The columns can be sorted by name or time and this can be explicitly expressed
via the configuration file, for any given column family.

The main limitation on column and supercolumn size is that all data for a single key must
fit on a single machine in the cluster.  Because keys alone are used to determine the nodes
responsible for replicating their data, the amount of data associated with a single key has
this upper bound.

= More Detail =

A row-oriented database stores rows in a row major fashion (i.e. all the columns in the row
are kept together). A column-oriented database on the other hand stores data on a per-column
basis. Column Families allow a hybrid approach. It allows you to break your row (the data
corresponding to a key) into a static number of groups a.k.a Column Families. In Cassandra,
the data in a table is stored in a separate file on a per-Column Family basis. And within
each column family, the data is stored in row (i.e. key) major order. Related columns, those
that you'll access together, should ideally be kept within the same column family for access
efficiency. Furthermore columns in a column family can be sorted and stored on disk either
in time sorted order or in name sorted order. However, individual SuperColumns are always
sorted by name.  Columns within a super column may be sorted by time. Suppose we define a
table called !MyTable with column families !MySuperColumnFamily (this a colu
 mn family of type Super) and !MyColumnFamily (this is simple column family). Any super column,
SC in the !MySuperColumnFamily is addressed as "!MySuperColumnFamily:SC" and any column "C"
within "SC" is addressed as !MySuperColumnFamily:SC:C. Any column C within !MySimpleColumnFamily
is addressed as "!MySimpleColumnFamily:C". In short ":" is reserved word and should not be
used as part of a Column Family name or as part of the name for a Super Column or Column.
 (We plan to address this limitation for the 0.4 release.)

= Range queries =

Cassandra supports pluggable partitioning schemes with a relatively small amount of code.
 Out of the box, Cassandra provides the hash-based RandomPartitioner and an OrderPreservingPartitioner.
 RandomPartitioner gives you pretty good load balancing with no further work required.  OrderPreservingPartitioner
on the other hand lets you perform range queries on the keys you have stored.  Systems that
only support hash-based partitioning cannot perform range queries efficiently.

= Example: SuperColumns for Search Apps =

You can think of each supercolumn name as a term and the columns within as the docids with
rank info and other attributes being a part of it. If you have keys as the userids then you
can have a per-user index stored in this form. This is how the per user index for term search
is laid out for Inbox search at Facebook. Furthermore since one has the option of storing
data on disk sorted by "Time" it is very easy for the system to answer queries of the form
"Give me the top 10 messages". For a pictorial explanation please refer to the Cassandra powerpoint
slides presented at SIGMOD 2008.

Mime
View raw message