cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <bur...@spinn3r.com>
Subject Re: Data model for streaming a large table in real time.
Date Sat, 07 Jun 2014 23:52:51 GMT
well you could add milliseconds, at best you're still bottlenecking most of
your writes one one box.. maybe 2-3 if there are ones that are lagging.

Anyway.. I think using 100 buckets is probably fine..

Kevin


On Sat, Jun 7, 2014 at 2:45 PM, Colin <colpclark@gmail.com> wrote:

> The add seconds to the bucket.  Also, the data will get cached-it's not
> going to hit disk on every read.
>
> Look at the key cache settings on the table.  Also, in 2.1 you have even
> more control over caching.
>
> --
> Colin
> 320-221-9531
>
>
> On Jun 7, 2014, at 4:30 PM, Kevin Burton <burton@spinn3r.com> wrote:
>
>
> On Sat, Jun 7, 2014 at 1:34 PM, Colin <colpclark@gmail.com> wrote:
>
>> Maybe it makes sense to describe what you're trying to accomplish in more
>> detail.
>>
>>
> Essentially , I'm appending writes of recent data by our crawler and
> sending that data to our customers.
>
> They need to sync to up to date writes…we need to get them writes within
> seconds.
>
> A common bucketing approach is along the lines of year, month, day, hour,
>> minute, etc and then use a timeuuid as a cluster column.
>>
>>
> I mean that is acceptable.. but that means for that 1 minute interval, all
> writes are going to that one node (and its replicas)
>
> So that means the total cluster throughput is bottlenecked on the max disk
> throughput.
>
> Same thing for reads… unless our customers are lagged, they are all going
> to stampede and ALL of them are going to read data from one node, in a one
> minute timeframe.
>
> That's no fun..  that will easily DoS our cluster.
>
>
>> Depending upon the semantics of the transport protocol you plan on
>> utilizing, either the client code keep track of pagination, or the app
>> server could, if you utilized some type of request/reply/ack flow.  You
>> could keep sequence numbers for each client, and begin streaming data to
>> them or allowing query upon reconnect, etc.
>>
>> But again, more details of the use case might prove useful.
>>
>>
> I think if we were to just 100 buckets it would probably work just fine.
>  We're probably not going to be more than 100 nodes in the next year and if
> we are that's still reasonable performance.
>
> I mean if each box has a 400GB SSD that's 40TB of VERY fast data.
>
> Kevin
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Mime
View raw message