Hi all, 

We are working on a data model for storing tweets for multiple streams (where a stream is defined by a number of keyword filters on the full twitter firehose), and retrieving the tweets by timestamp and hashtag. My question is whether the following data model would a good way for doing that, where I'm creating a column name index for the hashtags.

ColumnFamily: streamTweets
     key: streamID + dayTimestamp (creating daily buckets for each stream)
     columns => name: TimeUUID, value: tweet json (storing all the tweets for this stream in a wide row with a TimeUUID)

ColumnFamily: streamHashTagTweets
     key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
     columns => name: TimeUUID (referencing the TimeUUID value in the streamTweets ColumnFamily), value: tweetID

Retrieving the latest 1000 tweets (of a given day) is trivial by requesting the streamTweets columnFamily. Getting the latest tweets for a given hashtag would mean you have to get the TimeUUIDs from the streamHashTagTweets first, and then do a second get call on the streamTweets with the former TimeUUIDs as the list of columns we like to retrieve (column index).

Is referencing column names (TimeUUIDs) a smart thing to do when we have wide rows spanning millions of columns? It seems easier (one reference call) to do this, then it is to reference key values and running a multi-get to get all the rows (we want to get up to 1000 tweets). Referencing key values requires another columnFamily for tweets (key: tweetId, columns: 1 column with data).

Of course we could instead denormalize the data and store the tweet also in the streamHashTagTweet columns, but we want to do the same thing for other indexes as well (topics, twitter usernames, links, etc), so it quickly adds up in required storage space. Next to that we will request tweets by these secondary indexes quite infrequently, while the tweets by timestamp will be requested heavily.

Given we are estimating to store many TBs of tweets, we would prefer setting up machines with spinning disks (2TB per node) to save costs.

We would love to hear your feedback.