cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject CassandraLimitations reverted to revision 13 on Cassandra Wiki
Date Thu, 29 Jul 2010 21:03:12 GMT
Dear wiki user,

You have subscribed to a wiki page "Cassandra Wiki" for change notification.

The page CassandraLimitations has been reverted to revision 13 by BenjaminBlack.
The comment on this change is: Reverting the change to Chinese..


- = 限制 =
+ = Limitations =
- == 基本上不会变的一些特性 ==
+ == Stuff that isn't likely to change ==
+  * All data for a single row must fit (on disk) on a single machine in the cluster. Because
row keys alone are used to determine the nodes responsible for replicating their data, the
amount of data associated with a single key has this upper bound.
+  * A single column value may not be larger than 2GB.
-  * 每一行的所有数据必须能够存放在群集的一台机器上。因为群集只用键值来决定有哪个节点负责保存数据,所有这个键值关联的数据会有大小的上限.
-  * 每一列的总大小不能超过2G.
+ == Artifacts of the current code base ==
+  * Cassandra has two levels of indexes: key and column.  But in super columnfamilies there
is a third level of subcolumns; these are not indexed, and any request for a subcolumn deserializes
_all_ the subcolumns in that supercolumn.  So you want to avoid a data model that requires
large numbers of subcolumns. is open
to remove this limitation.
+  * <<Anchor(streaming)>>Cassandra's public API is based on Thrift, which offers
no streaming abilities -- any value written or fetched has to fit in memory.  This is inherent
to Thrift's design and is therefore unlikely to change.  So adding large object support to
Cassandra would need a special API that manually split the large objects up into pieces. A
potential approach is described in  As
a workaround in the meantime, you can manually split files into chunks of whatever size you
are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
+ == Obsolete Limitations ==
+  * Prior to version 0.7, Cassandra's compaction code deserialized an entire row (per columnfamily)
at a time.  So all the data from a given columnfamily/key pair had to fit in memory, or 2GB,
whichever was smaller (since the length of the row was serialized as a Java int).
+  * Prior to version 0.7, Thrift would crash Cassandra if you send random or malicious data
to it.  This made exposing the Cassandra port directly to the outside internet a Bad Idea.
+  * Prior to version 0.4, Cassandra did not fsync the commitlog before acking a write.  Most
of the time this is Good Enough when you are writing to multiple replicas since the odds are
slim of all replicas dying before the data actually hits the disk, but the truly paranoid
will want real fsync-before-ack.  This is now an option.
- == 当前版本的一些注意事项 ==
-  * Cassandra有两层的索引: 键值和列级别. 对于超级ColumnFamily,会存在第三层的子列;
-  * <<Anchor(streaming)>>Cassandra 的公共API是基于Thrift,Thrift 本身不支持Streaming
-- 这样所有读写的值必须能够在内存中容纳下.  此特性是Thrift的设计原则,也基本上不会改变.
所有如果你要保存超级大的对象到Cassandra的话,就需要一个特别的 API
负责把大数据切成小块. 这里有个潜在的实现:
 当然你也可以自己去把文件分割为小块,比如64K。 然后一个文件对应一行,每个文件有多个小块。这样达到了存放大对象的目的.
- == 历史版本的限制 ==
-  * 在 0.7 版之前,Cassandra的压缩 (compaction) 实现是一次反序列化整行,所有要求能够放在内存中。比如2G的限制.
-  * 在 0.7 版之前,如果你发送些随机或者恶意的数据给Thrift,它可能会挂掉。这样的话,无法直接暴露
Thrift 给外面的用户比如Internet用户.
-  * 在 0.4 版之前,Cassandra 只有收到写的确认时候才会同步文件的CommitLog。大多数时候这个都没有问题,因为你很少出现你写的所有副本还没有写道文件然后都挂掉的情况。现在你可以做到真正的同步CommitLog,不需要收到写确认.

View raw message