Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 16588 invoked from network); 15 Apr 2011 18:31:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Apr 2011 18:31:03 -0000 Received: (qmail 62065 invoked by uid 500); 15 Apr 2011 18:31:01 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 62038 invoked by uid 500); 15 Apr 2011 18:31:00 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 62030 invoked by uid 99); 15 Apr 2011 18:31:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Apr 2011 18:31:00 +0000 X-ASF-Spam-Status: No, hits=3.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ethanrowe000@gmail.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-iw0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Apr 2011 18:30:54 +0000 Received: by iwn39 with SMTP id 39so3126550iwn.31 for ; Fri, 15 Apr 2011 11:30:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=vg4F0zosI7LGOQfEIoe1XYS73A8r3YggR4RL6uRkHk0=; b=dGX5fivaCxk3KFitLXhPrAuRWa2aEeROJj1FDqAMjzpAhb9ovnTvT1UpMBH0VGUZLg Qrk0YPyfDlqicRbF3azhIEAIjyGCsaFBaBx1F310ovSHV9r/jnkIy6EevCsk6xVnyBEs QVBmAgVPRfpKVjKqqCvSLIkNbC3Z7yQC0fN/c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=OPyHsMAe7/330Fexq/cU5CL4/yZ0rFn8KwJCLfloYukYbFanuJEKroGmauAl6PA8Mu pzpeLhJLJF5lkexfzQt+2NKd2p3sCYquZYp3qkavRtPcZj0+bIfmbhSgoUz7NlvzpNqs CWTsMjUN2SGRlCPR5d4z8x7vdVhZ43EOkG5Q0= MIME-Version: 1.0 Received: by 10.43.62.72 with SMTP id wz8mr3096389icb.350.1302892233384; Fri, 15 Apr 2011 11:30:33 -0700 (PDT) Sender: ethanrowe000@gmail.com Received: by 10.43.135.1 with HTTP; Fri, 15 Apr 2011 11:30:33 -0700 (PDT) In-Reply-To: References: Date: Fri, 15 Apr 2011 14:30:33 -0400 X-Google-Sender-Auth: KmeGLpdL6Ch8Iftj74G1DJXd-bM Message-ID: Subject: Re: What's the best modeling approach for ordering events by date? From: Ethan Rowe To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec51ddc1bc060d504a0f93c1c X-Virus-Checked: Checked by ClamAV on apache.org --bcaec51ddc1bc060d504a0f93c1c Content-Type: text/plain; charset=ISO-8859-1 Hi. So, the OPP will direct all activity for a range of keys to a particular node (or set of nodes, in accordance with your replication factor). Depending on the volume of writes, this could be fine. Depending on the distribution of key values you write at any given time, it can also be fine. But if you're using the OPP, and your keys align with the time of receiving the data, and your application writes that data as it receives it, you're going to be placing write activity on effectively one node at a time, for the range of time allocated to that node. If you use RP, and can divide time into finer slices such that you have multiple tweets in a row, you trade off a more complex read in exchange for better distribution of load throughout your cluster. The necessity of this depends on your particulars. In your TweetsBySecond example, you're using a deterministic set of keys (the keys correspond to seconds since epoch). Querying for ranges of time is nice with OPP, but if the ranges of time you're interested in are constrained, you don't specifically need OPP. You could use RP and request all the keys for the seconds contained within the time range of interest. In this way, you balance writes across the cluster more effectively than you would with OPP, while still getting a workable data set. Again, the degree to which you need this is dependent on your situation. Others on the list will no doubt have more informed opinions on this than me. :) On Thu, Apr 14, 2011 at 8:00 PM, Guillermo Winkler wrote: > Hi Ethan, > > I want to present the events ordered by time, always in pages of 20/40 > events. If the events are tweets, you can have 1000 tweets from the same > second or you can have 30 tweets in a 10 minute range. But I always wanna be > able to page through the results in an orderly fashion. > > I think that using seconds since epoch it's what I'm doing, that is divide > time into a fixed series of interval. Each second is an interval, and all of > the events for that particular second are columns of that row. > > Again with tweets for easier visualizatoin > > TweetsBySecond : { > 12121121212 :{ -> seconds since epoch > id1,id2,id3 -> all the tweet ids ocurred in that particular second > }, > 12121212123 : { > id4,id5 > }, > 12121212124 : { > id6 > } > } > > The problem is you can't do that using OPP in cassandra 0.7, or it's just > me missing something? > > Thanks for your answer, > Guille > > On Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe wrote: > >> How do you plan to read the data? Entire histories, or in relatively >> confined slices of time? Do the events have any attributes by which you >> might segregate them, apart from time? >> >> If you can divide time into a fixed series of intervals, you can insert >> members of a given interval as columns (or supercolumns) in a row. But it >> depends how you want to use the data on the read side. >> >> >> On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler < >> gwinkler@inconcertcc.com> wrote: >> >>> I have a huge number of events I need to consume later, ordered by the >>> date the event occured. >>> >>> My first approach to this problem was to use seconds since epoch as row >>> key, and event ids as column names (empty value), this way: >>> >>> EventsByDate : { >>> SecondsSinceEpoch: { >>> evid:"", evid:"", evid:"" >>> } >>> } >>> >>> And use OPP as partitioner. Using GetRangeSlices to retrieve ordered >>> events secuentially. >>> >>> Now I have two problems to solve: >>> >>> 1) The system is realtime, so all the events in a given moment are >>> hitting the same box >>> 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like >>> LongType for row keys, was this purposedly deprecated? >>> >>> I was thinking about secondary indexes, but it does not assure the order >>> the rows are coming out of cassandra. >>> >>> Anyone has a better approach to model events by date given that >>> restrictions? >>> >>> Thanks, >>> Guille >>> >>> >>> >> > --bcaec51ddc1bc060d504a0f93c1c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi.

So, the OPP will direct all activity for a range of = keys to a particular node (or set of nodes, in accordance with your replica= tion factor). =A0Depending on the volume of writes, this could be fine. =A0= Depending on the distribution of key values you write at any given time, it= can also be fine. =A0But if you're using the OPP, and your keys align = with the time of receiving the data, and your application writes that data = as it receives it, you're going to be placing write activity on effecti= vely one node at a time, for the range of time allocated to that node.

If you use RP, and can divide time into finer slices su= ch that you have multiple tweets in a row, you trade off a more complex rea= d in exchange for better distribution of load throughout your cluster. =A0T= he necessity of this depends on your particulars.

In your TweetsBySecond example, you're using a dete= rministic set of keys (the keys correspond to seconds since epoch). =A0Quer= ying for ranges of time is nice with OPP, but if the ranges of time you'= ;re interested in are constrained, you don't specifically need OPP. =A0= You could use RP and request all the keys for the seconds contained within = the time range of interest. =A0In this way, you balance writes across the c= luster more effectively than you would with OPP, while still getting a work= able data set. =A0Again, the degree to which you need this is dependent on = your situation. =A0Others on the list will no doubt have more informed opin= ions on this than me. =A0:)

On Thu, Apr 14, 2011 at 8:00 PM, Guille= rmo Winkler <gwinkler@inconcertcc.com> wrote:
Hi Ethan,

I want to present the events ordere= d by time, always in pages of 20/40 events. If the events are tweets, you c= an have 1000 tweets from the same second or you can have 30 tweets in a 10 = minute range. But I always wanna be able to page through the results in an = orderly fashion.

I think that using seconds since epoch it's what I&= #39;m doing, that is divide time into a fixed series of interval. Each seco= nd is an interval, and all of the events for that particular second are col= umns of that row.

Again with tweets for easier visualizatoin
TweetsBySecond : {
=A012121121212 :{ -> seconds s= ince epoch
=A0id1,id2,id3 -> all the tweet ids ocurred in = that particular second
},
12121212123 : {
id4,id5
},
= 12121212124 : {
id6
}
}

<= div>The problem is you can't do that using OPP in cassandra 0.7, or it&= #39;s just me missing something?

Thanks for your answer,
Guille

On = Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe <ethan@the-rowes.com> wrote:
How do you plan to read the data? =A0Entire = histories, or in relatively confined slices of time? =A0Do the events have = any attributes by which you might segregate them, apart from time?

If you can divide time into a fixed series of intervals, you= can insert members of a given interval as columns (or supercolumns) in a r= ow. =A0But it depends how you want to use the data on the read side.


On Thu, Apr 14, 2011 at 12:25 PM, Guill= ermo Winkler <gwinkler@inconcertcc.com> wrote:
I have a huge number of events I need to consume later, ordered by the date= the event occured.

My first approach to this problem wa= s to use seconds since epoch as row key, and event ids as column names (emp= ty value), this way:

EventsByDate : {
=A0=A0 =A0SecondsSinceEpoch:= {
=A0=A0 =A0 =A0 =A0evid:"", evid:"", evid:&= quot;"
=A0=A0 =A0}
}

And = use OPP as partitioner. Using GetRangeSlices to retrieve ordered events sec= uentially.

Now I have two problems to solve:

<= div>1) The system is realtime, so all the events in a given moment are hitt= ing the same box
2) Migrating from cassandra 0.6 to cassandra 0.7= OPP doesn't seem to like LongType for row keys, was this purposedly de= precated?

I was thinking about secondary indexes, but it does not= assure the order the rows are coming out of cassandra.

Anyone has a better approach to model events by date given that restr= ictions?

Thanks,
Guille


<= /div>



--bcaec51ddc1bc060d504a0f93c1c--