Your data model should take into consideration the number of items you're storing in a collection.  If you expect it will grow over time with no small upper bound, don't use a collection.  You don't need to read before write to answer this question, it's a decision made at modeling time (before you ever write your very first record).  

If the possible values are finite and small, use a collection.  Otherwise normalize.  

Over time if you find your collections are getting large, then either an assumption changed or you modeled poorly.  Either way it's time to refactor.

DON'T STORE MORE THEN 100 THINGS IN A COLLECTION

Actually that's probably a bit too hard edged.  You could easily have a Set<Int> whose typical size is 1000.  If the data doesn't change often, and you always need to know all those values at the same time as each other, there's actually no problem with this.  Constantly mutating values are a problem as the collection gets large, or cases where you need to know only a subset of the the collection at a time.

-Eric Stevens
ProtectWise, Inc.


On Thu, Jun 6, 2013 at 10:59 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
The problem about "being careful about how much you store in a collection" is that Cassandra is a blind-write system. Knowing how much data is currently in the collection before you write is an anti-pattern, read before write.

Cassandra Rule 1: DON'T READ BEFORE WRITE
Cassandra Rule 2: ROWS CAN HAVE 2 BILLION COLUMNS
Collection Rule 1: DON'T STORE MORE THEN 100 THINGS IN A COLLECTION

Why does are user confused? Its simple.






 


On Thu, Jun 6, 2013 at 10:51 AM, Eric Stevens <mightye@gmail.com> wrote:
CQL3 does now support dynamic columns. For tags or metadata values you could use a Collection:

This should probably be clarified.  A collection is a super useful tool, but it is not the same thing as a dynamic column.  It has many advantages, but there is one huge disadvantage in that you have to be careful how much data you store in a collection. When you read a single value out of a collection, the entire collection is always read, which of course is true for appending data to the collection as well. 

With a traditional dynamic column, you could have added things like event logs to a record in the form of keys named "event:someEvent:TS" (or juxtapose the order as your needs dictate).  You could basically do this practically indefinitely with little degradation in performance.  This was also a common way of representing cross-family relationships (one-to-many style).

If you try to do the same thing with a collection, performance will degrade as your data grows.  For small or relatively static data sets (eg tags) that's fine.  For open-ended data sets (logs, events, one-to-many relationships that grow regularly), you should instead normalize such data into a separate column family.

-Eric Stevens
ProtectWise, Inc.


On Thu, Jun 6, 2013 at 9:49 AM, Francisco Andrades Grassi <bigjocker@gmail.com> wrote:
Hi,

CQL3 does now support dynamic columns. For tags or metadata values you could use a Collection:


For wide rows there's the enhanced primary keys, which I personally prefer over the composite columns of yore:


--
Francisco Andrades Grassi
@bigjocker

On Jun 6, 2013, at 8:32 AM, Joe Greenawalt <joe.greenawalt@gmail.com> wrote:

Hi,
I'm having some problems figuring out how to append a dynamic column on a column family using the datastax java driver 1.0 and CQL3 on Cassandra 1.2.5.  Below is what i'm trying:

cqlsh:simplex> create table user (firstname text primary key, lastname text);
cqlsh:simplex> insert into user (firstname, lastname) values ('joe','shmoe');
cqlsh:simplex> select * from user;

 firstname | lastname
-----------+----------
       joe |    shmoe

cqlsh:simplex> insert into user (firstname, lastname, middlename) values ('joe','shmoe','lester');
Bad Request: Unknown identifier middlename
cqlsh:simplex> insert into user (firstname, lastname, middlename) values ('john','shmoe','lester');
Bad Request: Unknown identifier middlename


I'm assuming you can do this based on previous based thrift based clients like pycassa, and also by reading this:

The Cassandra data model is a dynamic schema, column-oriented data model. This means that, unlike a relational database, you do not need to model all of the columns required by your application up front, as each row is not required to have the same set of columns. Columns and their metadata can be added by your application as they are needed without incurring downtime to your application.

here: http://www.datastax.com/docs/1.2/ddl/index

Is it a limitation of CQL3 and its connection vs. thrift?
Or more likely i'm just doing something wrong?

Thanks,
Joe