cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <>
Subject Re: Data model for streaming a large table in real time.
Date Sat, 07 Jun 2014 21:30:23 GMT
On Sat, Jun 7, 2014 at 1:34 PM, Colin <> wrote:

> Maybe it makes sense to describe what you're trying to accomplish in more
> detail.
Essentially , I'm appending writes of recent data by our crawler and
sending that data to our customers.

They need to sync to up to date writes…we need to get them writes within

A common bucketing approach is along the lines of year, month, day, hour,
> minute, etc and then use a timeuuid as a cluster column.
I mean that is acceptable.. but that means for that 1 minute interval, all
writes are going to that one node (and its replicas)

So that means the total cluster throughput is bottlenecked on the max disk

Same thing for reads… unless our customers are lagged, they are all going
to stampede and ALL of them are going to read data from one node, in a one
minute timeframe.

That's no fun..  that will easily DoS our cluster.

> Depending upon the semantics of the transport protocol you plan on
> utilizing, either the client code keep track of pagination, or the app
> server could, if you utilized some type of request/reply/ack flow.  You
> could keep sequence numbers for each client, and begin streaming data to
> them or allowing query upon reconnect, etc.
> But again, more details of the use case might prove useful.
I think if we were to just 100 buckets it would probably work just fine.
 We're probably not going to be more than 100 nodes in the next year and if
we are that's still reasonable performance.

I mean if each box has a 400GB SSD that's 40TB of VERY fast data.



Location: *San Francisco, CA*
Skype: *burtonator*
… or check out my Google+ profile
War is peace. Freedom is slavery. Ignorance is strength. Corporations are

View raw message