Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of ethanrowe000@gmail.com
 designates 209.85.214.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type;
        b=OPyHsMAe7/330Fexq/cU5CL4/yZ0rFn8KwJCLfloYukYbFanuJEKroGmauAl6PA8Mu
         pzpeLhJLJF5lkexfzQt+2NKd2p3sCYquZYp3qkavRtPcZj0+bIfmbhSgoUz7NlvzpNqs
         CWTsMjUN2SGRlCPR5d4z8x7vdVhZ43EOkG5Q0=
MIME-Version: 1.0
Sender: ethanrowe000@gmail.com
In-Reply-To: <BANLkTimgwwhK7zNX234rEbzROnDca9G1Zw@mail.gmail.com>
References: <BANLkTi=DuTgrShN8fbai3XTUKAuqXy4r-w@mail.gmail.com>
	<BANLkTi=91881PN=xksnj_VPiddK3cvSMEA@mail.gmail.com>
	<BANLkTimgwwhK7zNX234rEbzROnDca9G1Zw@mail.gmail.com>
Date: Fri, 15 Apr 2011 14:30:33 -0400
Message-ID: <BANLkTimWdyvUvBvAeo-Gr8JBtOG-PMgsrA@mail.gmail.com>
Subject: Re: What's the best modeling approach for ordering events by date?
From: Ethan Rowe <ethan@the-rowes.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=bcaec51ddc1bc060d504a0f93c1c

--bcaec51ddc1bc060d504a0f93c1c
Content-Type: text/plain; charset=ISO-8859-1

Hi.

So, the OPP will direct all activity for a range of keys to a particular
node (or set of nodes, in accordance with your replication factor).
 Depending on the volume of writes, this could be fine.  Depending on the
distribution of key values you write at any given time, it can also be fine.
 But if you're using the OPP, and your keys align with the time of receiving
the data, and your application writes that data as it receives it, you're
going to be placing write activity on effectively one node at a time, for
the range of time allocated to that node.

If you use RP, and can divide time into finer slices such that you have
multiple tweets in a row, you trade off a more complex read in exchange for
better distribution of load throughout your cluster.  The necessity of this
depends on your particulars.

In your TweetsBySecond example, you're using a deterministic set of keys
(the keys correspond to seconds since epoch).  Querying for ranges of time
is nice with OPP, but if the ranges of time you're interested in are
constrained, you don't specifically need OPP.  You could use RP and request
all the keys for the seconds contained within the time range of interest.
 In this way, you balance writes across the cluster more effectively than
you would with OPP, while still getting a workable data set.  Again, the
degree to which you need this is dependent on your situation.  Others on the
list will no doubt have more informed opinions on this than me.  :)

On Thu, Apr 14, 2011 at 8:00 PM, Guillermo Winkler <gwinkler@inconcertcc.com
> wrote:

> Hi Ethan,
>
> I want to present the events ordered by time, always in pages of 20/40
> events. If the events are tweets, you can have 1000 tweets from the same
> second or you can have 30 tweets in a 10 minute range. But I always wanna be
> able to page through the results in an orderly fashion.
>
> I think that using seconds since epoch it's what I'm doing, that is divide
> time into a fixed series of interval. Each second is an interval, and all of
> the events for that particular second are columns of that row.
>
> Again with tweets for easier visualizatoin
>
> TweetsBySecond : {
>  12121121212 :{ -> seconds since epoch
>  id1,id2,id3 -> all the tweet ids ocurred in that particular second
> },
> 12121212123 : {
> id4,id5
> },
> 12121212124 : {
> id6
> }
> }
>
> The problem is you can't do that using OPP in cassandra 0.7, or it's just
> me missing something?
>
> Thanks for your answer,
> Guille
>
> On Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe <ethan@the-rowes.com> wrote:
>
>> How do you plan to read the data?  Entire histories, or in relatively
>> confined slices of time?  Do the events have any attributes by which you
>> might segregate them, apart from time?
>>
>> If you can divide time into a fixed series of intervals, you can insert
>> members of a given interval as columns (or supercolumns) in a row.  But it
>> depends how you want to use the data on the read side.
>>
>>
>> On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler <
>> gwinkler@inconcertcc.com> wrote:
>>
>>> I have a huge number of events I need to consume later, ordered by the
>>> date the event occured.
>>>
>>> My first approach to this problem was to use seconds since epoch as row
>>> key, and event ids as column names (empty value), this way:
>>>
>>> EventsByDate : {
>>>     SecondsSinceEpoch: {
>>>         evid:"", evid:"", evid:""
>>>     }
>>> }
>>>
>>> And use OPP as partitioner. Using GetRangeSlices to retrieve ordered
>>> events secuentially.
>>>
>>> Now I have two problems to solve:
>>>
>>> 1) The system is realtime, so all the events in a given moment are
>>> hitting the same box
>>> 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like
>>> LongType for row keys, was this purposedly deprecated?
>>>
>>> I was thinking about secondary indexes, but it does not assure the order
>>> the rows are coming out of cassandra.
>>>
>>> Anyone has a better approach to model events by date given that
>>> restrictions?
>>>
>>> Thanks,
>>> Guille
>>>
>>>
>>>
>>
>

--bcaec51ddc1bc060d504a0f93c1c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi.<div><br></div><div>So, the OPP will direct all activity for a range of =
keys to a particular node (or set of nodes, in accordance with your replica=
tion factor). =A0Depending on the volume of writes, this could be fine. =A0=
Depending on the distribution of key values you write at any given time, it=
 can also be fine. =A0But if you&#39;re using the OPP, and your keys align =
with the time of receiving the data, and your application writes that data =
as it receives it, you&#39;re going to be placing write activity on effecti=
vely one node at a time, for the range of time allocated to that node.</div=
>
<div><br></div><div>If you use RP, and can divide time into finer slices su=
ch that you have multiple tweets in a row, you trade off a more complex rea=
d in exchange for better distribution of load throughout your cluster. =A0T=
he necessity of this depends on your particulars.</div>
<div><br></div><div>In your TweetsBySecond example, you&#39;re using a dete=
rministic set of keys (the keys correspond to seconds since epoch). =A0Quer=
ying for ranges of time is nice with OPP, but if the ranges of time you&#39=
;re interested in are constrained, you don&#39;t specifically need OPP. =A0=
You could use RP and request all the keys for the seconds contained within =
the time range of interest. =A0In this way, you balance writes across the c=
luster more effectively than you would with OPP, while still getting a work=
able data set. =A0Again, the degree to which you need this is dependent on =
your situation. =A0Others on the list will no doubt have more informed opin=
ions on this than me. =A0:)</div>
<div><br><div class=3D"gmail_quote">On Thu, Apr 14, 2011 at 8:00 PM, Guille=
rmo Winkler <span dir=3D"ltr">&lt;<a href=3D"mailto:gwinkler@inconcertcc.co=
m">gwinkler@inconcertcc.com</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex;">
<div>Hi Ethan,</div><div><br></div><div>I want to present the events ordere=
d by time, always in pages of 20/40 events. If the events are tweets, you c=
an have 1000 tweets from the same second or you can have 30 tweets in a 10 =
minute range. But I always wanna be able to page through the results in an =
orderly fashion.</div>


<div><br></div><div>I think that using seconds since epoch it&#39;s what I&=
#39;m doing, that is divide time into a fixed series of interval. Each seco=
nd is an interval, and all of the events for that particular second are col=
umns of that row.</div>


<div><br></div><div>Again with tweets for easier visualizatoin</div><div><b=
r></div><div>TweetsBySecond : {</div><div>=A012121121212 :{ -&gt; seconds s=
ince epoch<br></div><div>=A0id1,id2,id3 -&gt; all the tweet ids ocurred in =
that particular second</div>


<div>},</div><div>12121212123 : {</div><div>id4,id5</div><div>},</div><div>=
12121212124 : {</div><div>id6</div><div>}</div><div>}</div><div><br></div><=
div>The problem is you can&#39;t do that using OPP in cassandra 0.7, or it&=
#39;s just me missing something?</div>


<div><br></div><div>Thanks for your answer,</div><div>Guille</div><div><div=
></div><div class=3D"h5"><div><br></div><div><div class=3D"gmail_quote">On =
Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:ethan@the-rowes.com" target=3D"_blank">ethan@the-rowes.com</a>&gt;</s=
pan> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">How do you plan to read the data? =A0Entire =
histories, or in relatively confined slices of time? =A0Do the events have =
any attributes by which you might segregate them, apart from time?<div>


<br></div><div>If you can divide time into a fixed series of intervals, you=
 can insert members of a given interval as columns (or supercolumns) in a r=
ow. =A0But it depends how you want to use the data on the read side.<div>


<div></div><div><br>
<div><br><div class=3D"gmail_quote">On Thu, Apr 14, 2011 at 12:25 PM, Guill=
ermo Winkler <span dir=3D"ltr">&lt;<a href=3D"mailto:gwinkler@inconcertcc.c=
om" target=3D"_blank">gwinkler@inconcertcc.com</a>&gt;</span> wrote:<br><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex">


I have a huge number of events I need to consume later, ordered by the date=
 the event occured.<div><br></div><div>My first approach to this problem wa=
s to use seconds since epoch as row key, and event ids as column names (emp=
ty value), this way:</div>


<div><br></div><div>EventsByDate : {</div><div>=A0=A0 =A0SecondsSinceEpoch:=
 {</div><div>=A0=A0 =A0 =A0 =A0evid:&quot;&quot;, evid:&quot;&quot;, evid:&=
quot;&quot;</div><div>=A0=A0 =A0}</div><div>}</div><div><br></div><div>And =
use OPP as partitioner. Using GetRangeSlices to retrieve ordered events sec=
uentially.</div>


<div><br></div><div>Now I have two problems to solve:</div><div><br></div><=
div>1) The system is realtime, so all the events in a given moment are hitt=
ing the same box</div><div>2) Migrating from cassandra 0.6 to cassandra 0.7=
 OPP doesn&#39;t seem to like LongType for row keys, was this purposedly de=
precated?</div>


<div><br></div><div>I was thinking about secondary indexes, but it does not=
 assure the order the rows are coming out of cassandra.</div><div><br></div=
><div>Anyone has a better approach to model events by date given that restr=
ictions?</div>


<div><br></div><div>Thanks,</div><div>Guille</div><div><br></div><div><br><=
/div>

</blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>

</div></div></blockquote></div><br></div>

--bcaec51ddc1bc060d504a0f93c1c--