Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B8F9A11F80 for ; Sat, 7 Jun 2014 21:46:22 +0000 (UTC) Received: (qmail 88639 invoked by uid 500); 7 Jun 2014 21:46:20 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 88604 invoked by uid 500); 7 Jun 2014 21:46:20 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 88596 invoked by uid 99); 7 Jun 2014 21:46:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Jun 2014 21:46:20 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of colpclark@gmail.com designates 209.85.213.181 as permitted sender) Received: from [209.85.213.181] (HELO mail-ig0-f181.google.com) (209.85.213.181) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Jun 2014 21:46:16 +0000 Received: by mail-ig0-f181.google.com with SMTP id h3so2086897igd.2 for ; Sat, 07 Jun 2014 14:45:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:references:from:content-type:in-reply-to:message-id:date:to :content-transfer-encoding:mime-version; bh=xTFEfPSTXM9JeXru4pTTRcTIyAqvo0Aj2OptAoX2h6E=; b=Ha21gdPycPMXO37ZX4A50V80+L0cQtn0wZoi6J3lGcxYBnqS10V9G+AOIAG3jY7/g6 0DZQD9tTR1sGKgXnOzESNGA5IkZzqcXRH4OGUsDpMc7QkklQtaerd3L3eyjJrAam0Qgy keJIJOgpPC9Vdux3XJoMXl8sgRUxYYI4UgbfCdcXFxQYhq1cIPAsFA3rGNOIXFEyh54l aTnfRqw0mtknjYDMnu/tPO5Y0ZTHx0VZCqDVZevWGMrRiVLaImWGZS+A1z5y801wkOdl E7ZLwV7ysaxyQfU2JkwDeZGPU5SXRiU4vzC3Rl9gOoElVB6EFMVMeLRgnFttGAPkFS7a F+Sw== X-Received: by 10.50.65.3 with SMTP id t3mr19178582igs.20.1402177552492; Sat, 07 Jun 2014 14:45:52 -0700 (PDT) Received: from [10.161.141.19] (98.sub-70-197-192.myvzw.com. [70.197.192.98]) by mx.google.com with ESMTPSA id q5sm72184054igg.10.2014.06.07.14.45.50 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 07 Jun 2014 14:45:51 -0700 (PDT) Subject: Re: Data model for streaming a large table in real time. References: <81A38010-3844-47F7-8DF6-5F069D12AD7C@gmail.com> <-6327655872995295763@unknownmsgid> From: Colin Content-Type: multipart/alternative; boundary=Apple-Mail-B78D6699-F558-439A-9EA6-8801FF4CE1A2 X-Mailer: iPhone Mail (11D201) In-Reply-To: Message-Id: Date: Sat, 7 Jun 2014 16:45:48 -0500 To: "user@cassandra.apache.org" Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (1.0) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-B78D6699-F558-439A-9EA6-8801FF4CE1A2 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable The add seconds to the bucket. Also, the data will get cached-it's not goin= g to hit disk on every read. Look at the key cache settings on the table. Also, in 2.1 you have even mor= e control over caching. -- Colin 320-221-9531 > On Jun 7, 2014, at 4:30 PM, Kevin Burton wrote: >=20 >=20 >> On Sat, Jun 7, 2014 at 1:34 PM, Colin wrote: >> Maybe it makes sense to describe what you're trying to accomplish in more= detail. >=20 > Essentially , I'm appending writes of recent data by our crawler and sendi= ng that data to our customers. > =20 > They need to sync to up to date writes=E2=80=A6we need to get them writes w= ithin seconds.=20 >=20 >> A common bucketing approach is along the lines of year, month, day, hour,= minute, etc and then use a timeuuid as a cluster column. =20 >=20 > I mean that is acceptable.. but that means for that 1 minute interval, all= writes are going to that one node (and its replicas) >=20 > So that means the total cluster throughput is bottlenecked on the max disk= throughput. >=20 > Same thing for reads=E2=80=A6 unless our customers are lagged, they are al= l going to stampede and ALL of them are going to read data from one node, in= a one minute timeframe. >=20 > That's no fun.. that will easily DoS our cluster. > =20 >> Depending upon the semantics of the transport protocol you plan on utiliz= ing, either the client code keep track of pagination, or the app server coul= d, if you utilized some type of request/reply/ack flow. You could keep sequ= ence numbers for each client, and begin streaming data to them or allowing q= uery upon reconnect, etc. >>=20 >> But again, more details of the use case might prove useful. >=20 > I think if we were to just 100 buckets it would probably work just fine. W= e're probably not going to be more than 100 nodes in the next year and if we= are that's still reasonable performance. =20 >=20 > I mean if each box has a 400GB SSD that's 40TB of VERY fast data.=20 >=20 > Kevin >=20 > --=20 > Founder/CEO Spinn3r.com > Location: San Francisco, CA > Skype: burtonator > blog: http://burtonator.wordpress.com > =E2=80=A6 or check out my Google+ profile >=20 > War is peace. Freedom is slavery. Ignorance is strength. Corporations are p= eople. --Apple-Mail-B78D6699-F558-439A-9EA6-8801FF4CE1A2 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
The add seconds to the bucket.  A= lso, the data will get cached-it's not going to hit disk on every read.

Look at the key cache settings on the table.  Also= , in 2.1 you have even more control over caching.

--
Colin
<= div>320-221-9531


On Jun 7, 2014, at 4:30 P= M, Kevin Burton <burton@spinn3r.com= > wrote:


On Sat, Jun 7, 20= 14 at 1:34 PM, Colin <colpclark@gmail.com> wrote:
Maybe it makes sense to= describe what you're trying to accomplish in more detail.


Essentially , I'm appending wri= tes of recent data by our crawler and sending that data to our customers.
 
They need to sync to up to date writes=E2=80=A6= we need to get them writes within seconds. 

A common bucketing approach is along the lines of year, month, day, hour,= minute, etc and then use a timeuuid as a cluster column.  


I mean that is accepta= ble.. but that means for that 1 minute interval, all writes are going to tha= t one node (and its replicas)

So that means the tot= al cluster throughput is bottlenecked on the max disk throughput.

Same thing for reads=E2=80=A6 unless our customers are l= agged, they are all going to stampede and ALL of them are going to read data= from one node, in a one minute timeframe.

That's n= o fun..  that will easily DoS our cluster.
 
=
Depending upon the semantics of the transport protocol you plan on util= izing, either the client code keep track of pagination, or the app server co= uld, if you utilized some type of request/reply/ack flow.  You could ke= ep sequence numbers for each client, and begin streaming data to them or all= owing query upon reconnect, etc.

But again, more details of the use case might prove usef= ul.


I think if we= were to just 100 buckets it would probably work just fine.  We're prob= ably not going to be more than 100 nodes in the next year and if we are that= 's still reasonable performance.  

I mean if each box has a 400GB SSD that's 40TB of VERY f= ast data.  

Kevin

--=

Founder/CEO S= pinn3r.com
Location: San Francisco, CA
Skyp= e: burtonator
=E2=80=A6 or check out my Google+ profile
War is peace. Freedom i= s slavery. Ignorance is strength. Corporations are people.

= --Apple-Mail-B78D6699-F558-439A-9EA6-8801FF4CE1A2--