incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Coen Stevens <beatle...@gmail.com>
Subject Re: Column index vs Row index vs Denormalizing
Date Thu, 11 Apr 2013 15:09:35 GMT
Thanks for the feedback! We will be going forward by implementing and
deploying the proposed model, and test it out.

Cheers,
Coen


On Thu, Apr 11, 2013 at 12:21 PM, aaron morton <aaron@thelastpickle.com>wrote:

> Retrieving the latest 1000 tweets (of a given day) is trivial by
> requesting the streamTweets columnFamily.
>
> If you normally want to get the most recent items use a reverse comparator
> on the column name
> see http://thelastpickle.com/2011/10/03/Reverse-Comparators/
>
> Getting the latest tweets for a given hashtag would mean you have to get
> the TimeUUIDs from the streamHashTagTweets first, and then do a second get
> call on the streamTweets with the former TimeUUIDs as the list of columns
> we like to retrieve (column index).
>
> You choices here depend on what sort of queries are the most frequent and
> how much disk space you have.
>
> You current model makes sense if the stream by day is the most frequent
> query, and you want to conserve disk space. If disk space is not an issue
> you can denormalise further and store the tweet JSON.
>
> If you have potentially many streamHashTagTweets rows where a single tweet
> is replicated it may make sense to stick with the current design to reduce
> disk use.
>
> (we want to get up to 1000 tweets).
>
> If you want to get 1000 anything from cassandra please break the multiget
> up into multiple calls. Each row request becomes a task in the thread pools
> on RF nodes. If you have a small ish cluster one client asking for 1000
> rows will temporarily block other clients and hurt request throughput.
>
>  Referencing key values requires another columnFamily for tweets (key:
> tweetId, columns: 1 column with data).
>
> This will be a more efficient (aka faster) read than reading from the a
> wide row.
>
> Next to that we will request tweets by these secondary indexes quite
> infrequently, while the tweets by timestamp will be requested heavily.
>
> If the hot path is the streamTweets calls demoralise into that, and
> normalise the tweet storage into it's own CF and reference them from
> the streamHashTagTweets. Have a canonical store of the events / tweets /
> entities  addressable by their business key can give you more flexibility.
>
> Given we are estimating to store many TBs of tweets, we would prefer
> setting up machines with spinning disks (2TB per node) to save costs.
>
> If you have spinning disks and 1G networking the rule of thumb is 300GB to
> 500GB per node. See previous discussions about size per node.
>
> Cheers
>
>    -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 10/04/2013, at 2:00 AM, Coen Stevens <beatlevic@gmail.com> wrote:
>
> Hi all,
>
> We are working on a data model for storing tweets for multiple streams
> (where a stream is defined by a number of keyword filters on the full
> twitter firehose), and retrieving the tweets by timestamp and hashtag. My
> question is whether the following data model would a good way for doing
> that, where I'm creating a column name index for the hashtags.
>
> ColumnFamily: streamTweets
>      key: streamID + dayTimestamp (creating daily buckets for each stream)
>      columns => name: TimeUUID, value: tweet json (storing all the tweets
> for this stream in a wide row with a TimeUUID)
>
> ColumnFamily: streamHashTagTweets
>      key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
>      columns => name: TimeUUID (referencing the TimeUUID value in the
> streamTweets ColumnFamily), value: tweetID
>
> Retrieving the latest 1000 tweets (of a given day) is trivial by
> requesting the streamTweets columnFamily. Getting the latest tweets for a
> given hashtag would mean you have to get the TimeUUIDs from the
> streamHashTagTweets first, and then do a second get call on the
> streamTweets with the former TimeUUIDs as the list of columns we like to
> retrieve (column index).
>
> Is referencing column names (TimeUUIDs) a smart thing to do when we have
> wide rows spanning millions of columns? It seems easier (one reference
> call) to do this, then it is to reference key values and running a
> multi-get to get all the rows (we want to get up to 1000 tweets).
> Referencing key values requires another columnFamily for tweets (key:
> tweetId, columns: 1 column with data).
>
> Of course we could instead denormalize the data and store the tweet also
> in the streamHashTagTweet columns, but we want to do the same thing for
> other indexes as well (topics, twitter usernames, links, etc), so it
> quickly adds up in required storage space. Next to that we will request
> tweets by these secondary indexes quite infrequently, while the tweets by
> timestamp will be requested heavily.
>
> Given we are estimating to store many TBs of tweets, we would prefer
> setting up machines with spinning disks (2TB per node) to save costs.
>
> We would love to hear your feedback.
>
> Cheers,
> Coen
>
>
>

Mime
View raw message