incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin <colpcl...@gmail.com>
Subject Re: Data model for streaming a large table in real time.
Date Sat, 07 Jun 2014 20:34:29 GMT
Maybe it makes sense to describe what you're trying to accomplish in more detail.

A common bucketing approach is along the lines of year, month, day, hour, minute, etc and
then use a timeuuid as a cluster column.  

Depending upon the semantics of the transport protocol you plan on utilizing, either the client
code keep track of pagination, or the app server could, if you utilized some type of request/reply/ack
flow.  You could keep sequence numbers for each client, and begin streaming data to them or
allowing query upon reconnect, etc.

But again, more details of the use case might prove useful.

--
Colin
320-221-9531


> On Jun 7, 2014, at 1:53 PM, Kevin Burton <burton@spinn3r.com> wrote:
> 
> Another way around this is to have a separate table storing the number of buckets.
> 
> This way if you have too few buckets, you can just increase them in the future.
> 
> Of course, the older data will still have too few buckets :-(
> 
> 
>> On Sat, Jun 7, 2014 at 11:09 AM, Kevin Burton <burton@spinn3r.com> wrote:
>> 
>>> On Sat, Jun 7, 2014 at 10:41 AM, Colin Clark <colin@clark.ws> wrote:
>>> It's an anti-pattern and there are better ways to do this.
>> 
>> Entirely possible :)
>> 
>> It would be nice to have a document with a bunch of common cassandra design patterns.
>> 
>> I've been trying to track down a pattern for this and a lot of this is pieced in
different places an individual blogs posts so one has to reverse engineer it.
>>  
>>> I have implemented the paging algorithm you've described using wide rows and
bucketing.  This approach is a more efficient utilization of Cassandra's built in wholesome
goodness.
>> 
>> So.. I assume the general pattern is to:
>> 
>> create a bucket.. you create like 2^16 buckets, this is your partition key.   
>> 
>> Then you place a timestamp next to the bucket in a primary key.
>> 
>> So essentially:
>> 
>> primary key( bucket, timestamp )… 
>> 
>> .. so to read from this buck you essentially execute: 
>> 
>> select * from foo where bucket = 100 and timestamp > 12345790 limit 10000;
>>  
>>> 
>>> Also, I wouldn't let any number of clients (huge) connect directly the cluster
to do this-put some type of app server in between to handle the comm's and fan out.  You'll
get better utilization of resources and less overhead in addition to flexibility of which
data center you're utilizing to serve requests. 
>> 
>> this is interesting… since the partition is the bucket, you could make some poor
decisions based on the number of buckets.
>> 
>> For example, 
>> 
>> if you use 2^64 buckets, the number of items in each bucket is going to be rather
small.  So you're going to have tons of queries each fetching 0-1 row (if you have a small
amount of data).
>> 
>> But if you use very FEW buckets.. say 5, but you have a cluster of 1000 nodes, then
you will have 5 of these buckets on 5 nodes, and the rest of the nodes without any data.
>> 
>> Hm..
>> 
>> the byte ordered partitioner solves this problem because I can just pick a fixed
number of buckets and then this is the primary key prefix and the data in a bucket can be
split up across machines based on any arbitrary split even in the middle of a 'bucket' …
>> 
>> 
>> -- 
>> Founder/CEO Spinn3r.com
>> Location: San Francisco, CA
>> Skype: burtonator
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> 
>> War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
> 
> 
> 
> -- 
> Founder/CEO Spinn3r.com
> Location: San Francisco, CA
> Skype: burtonator
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.

Mime
View raw message