Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of colpclark@gmail.com designates
 209.85.213.181 as permitted sender)
Subject: Re: Data model for streaming a large table in real time.
References: 
 <CAAZU44kSAwFQVHedh6hTE8pV19L=4_ROjs1pr9hT2PxZNp=tUg@mail.gmail.com>
 <81A38010-3844-47F7-8DF6-5F069D12AD7C@gmail.com>
 <CAAZU44nYNNXjm+gOa1T-5JGc2m2F7-U5H5QPHhNfqjRnJ-RyOw@mail.gmail.com>
 <-6327655872995295763@unknownmsgid>
 <CAAZU44kJ0VSccBqp7YyMmsyWmkK-jH4GEKDkVitvnzzzcW=KEQ@mail.gmail.com>
 <CAAZU44=FSyx27eJphEvJu=8tKP9uDU8c9fUy7hn0ZP0xCU3SDw@mail.gmail.com>
 <E4DE3713-538A-4F7D-BD93-1D5D14C755CA@gmail.com>
 <CAAZU44=UBOr8UPoCanaSQYRgP=wx3YJtJcFS2pKkMGrLgPFX8Q@mail.gmail.com>
From: Colin <colpclark@gmail.com>
Content-Type: multipart/alternative;
	boundary=Apple-Mail-B78D6699-F558-439A-9EA6-8801FF4CE1A2
In-Reply-To: 
 <CAAZU44=UBOr8UPoCanaSQYRgP=wx3YJtJcFS2pKkMGrLgPFX8Q@mail.gmail.com>
Message-Id: <DA8C70A8-1CED-46E8-8C5F-2891A9820612@gmail.com>
Date: Sat, 7 Jun 2014 16:45:48 -0500
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (1.0)


--Apple-Mail-B78D6699-F558-439A-9EA6-8801FF4CE1A2
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable

The add seconds to the bucket.  Also, the data will get cached-it's not goin=
g to hit disk on every read.

Look at the key cache settings on the table.  Also, in 2.1 you have even mor=
e control over caching.

--
Colin
320-221-9531


> On Jun 7, 2014, at 4:30 PM, Kevin Burton <burton@spinn3r.com> wrote:
>=20
>=20
>> On Sat, Jun 7, 2014 at 1:34 PM, Colin <colpclark@gmail.com> wrote:
>> Maybe it makes sense to describe what you're trying to accomplish in more=
 detail.
>=20
> Essentially , I'm appending writes of recent data by our crawler and sendi=
ng that data to our customers.
> =20
> They need to sync to up to date writes=E2=80=A6we need to get them writes w=
ithin seconds.=20
>=20
>> A common bucketing approach is along the lines of year, month, day, hour,=
 minute, etc and then use a timeuuid as a cluster column. =20
>=20
> I mean that is acceptable.. but that means for that 1 minute interval, all=
 writes are going to that one node (and its replicas)
>=20
> So that means the total cluster throughput is bottlenecked on the max disk=
 throughput.
>=20
> Same thing for reads=E2=80=A6 unless our customers are lagged, they are al=
l going to stampede and ALL of them are going to read data from one node, in=
 a one minute timeframe.
>=20
> That's no fun..  that will easily DoS our cluster.
> =20
>> Depending upon the semantics of the transport protocol you plan on utiliz=
ing, either the client code keep track of pagination, or the app server coul=
d, if you utilized some type of request/reply/ack flow.  You could keep sequ=
ence numbers for each client, and begin streaming data to them or allowing q=
uery upon reconnect, etc.
>>=20
>> But again, more details of the use case might prove useful.
>=20
> I think if we were to just 100 buckets it would probably work just fine.  W=
e're probably not going to be more than 100 nodes in the next year and if we=
 are that's still reasonable performance. =20
>=20
> I mean if each box has a 400GB SSD that's 40TB of VERY fast data.=20
>=20
> Kevin
>=20
> --=20
> Founder/CEO Spinn3r.com
> Location: San Francisco, CA
> Skype: burtonator
> blog: http://burtonator.wordpress.com
> =E2=80=A6 or check out my Google+ profile
>=20
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are p=
eople.

--Apple-Mail-B78D6699-F558-439A-9EA6-8801FF4CE1A2
Content-Type: text/html;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; charset=3D=
utf-8"></head><body dir=3D"auto"><div>The add seconds to the bucket. &nbsp;A=
lso, the data will get cached-it's not going to hit disk on every read.</div=
><div><br></div><div>Look at the key cache settings on the table. &nbsp;Also=
, in 2.1 you have even more control over caching.<br><br>--<div>Colin</div><=
div>320-221-9531</div><div><br></div></div><div><br>On Jun 7, 2014, at 4:30 P=
M, Kevin Burton &lt;<a href=3D"mailto:burton@spinn3r.com">burton@spinn3r.com=
</a>&gt; wrote:<br><br></div><blockquote type=3D"cite"><div><div dir=3D"ltr"=
><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Sat, Jun 7, 20=
14 at 1:34 PM, Colin <span dir=3D"ltr">&lt;<a href=3D"mailto:colpclark@gmail=
.com" target=3D"_blank">colpclark@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"auto"><div>Maybe it makes sense to=
 describe what you're trying to accomplish in more detail.</div><div><br>

</div></div></blockquote><div><br></div><div>Essentially , I'm appending wri=
tes of recent data by our crawler and sending that data to our customers.</d=
iv><div>&nbsp;<br></div><div>They need to sync to up to date writes=E2=80=A6=
we need to get them writes within seconds.&nbsp;</div>

<div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"auto"><div></div><d=
iv>A common bucketing approach is along the lines of year, month, day, hour,=
 minute, etc and then use a timeuuid as a cluster column. &nbsp;</div>

<div><br></div></div></blockquote><div><br></div><div>I mean that is accepta=
ble.. but that means for that 1 minute interval, all writes are going to tha=
t one node (and its replicas)</div><div><br></div><div>So that means the tot=
al cluster throughput is bottlenecked on the max disk throughput.</div>

<div><br></div><div>Same thing for reads=E2=80=A6 unless our customers are l=
agged, they are all going to stampede and ALL of them are going to read data=
 from one node, in a one minute timeframe.</div><div><br></div><div>That's n=
o fun.. &nbsp;that will easily DoS our cluster.</div>

<div>&nbsp;</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e=
x;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"auto"><div></div>=
<div>Depending upon the semantics of the transport protocol you plan on util=
izing, either the client code keep track of pagination, or the app server co=
uld, if you utilized some type of request/reply/ack flow. &nbsp;You could ke=
ep sequence numbers for each client, and begin streaming data to them or all=
owing query upon reconnect, etc.</div>

<div><br></div><div>But again, more details of the use case might prove usef=
ul.</div><div><br></div></div></blockquote><div><br></div><div>I think if we=
 were to just 100 buckets it would probably work just fine. &nbsp;We're prob=
ably not going to be more than 100 nodes in the next year and if we are that=
's still reasonable performance. &nbsp;</div>

<div><br></div><div>I mean if each box has a 400GB SSD that's 40TB of VERY f=
ast data. &nbsp;</div><div><br></div><div>Kevin</div></div><div><br></div>--=
 <br><div><div><p style=3D"margin-top:0px;margin-right:0px;margin-bottom:12p=
t;margin-left:0px">

</p><div>Founder/CEO&nbsp;<a href=3D"http://Spinn3r.com" target=3D"_blank">S=
pinn3r.com</a><br></div><div>Location:&nbsp;<b>San Francisco, CA</b><br>Skyp=
e:&nbsp;<b>burtonator</b></div><div><font color=3D"#2c2c2c" face=3D"Helvetic=
a, Arial, sans-serif"><span style=3D"line-height:19px">blog:<b>&nbsp;</b></s=
pan></font><a href=3D"http://burtonator.wordpress.com" target=3D"_blank">htt=
p://burtonator.wordpress.com</a></div>

<div>=E2=80=A6 or check out my <a href=3D"https://plus.google.com/1027182747=
91889610666/posts" target=3D"_blank">Google+ profile</a></div><div><a href=3D=
"http://spinn3r.com" target=3D"_blank"><img src=3D"http://spinn3r.com/images=
/spinn3r.jpg"></a></div>

<div><span style=3D"color:rgb(0,0,0);font-family:verdana,arial,helvetica,san=
s-serif;font-size:small;font-style:normal;font-variant:normal;font-weight:no=
rmal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0=
px;text-transform:none;white-space:normal;word-spacing:0px;background-color:=
rgb(255,255,255);display:inline!important;float:none">War is peace. Freedom i=
s slavery. Ignorance is strength. Corporations are people.</span></div>

<p></p></div></div>
</div></div>
</div></blockquote></body></html>=

--Apple-Mail-B78D6699-F558-439A-9EA6-8801FF4CE1A2--