incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Rowe <et...@the-rowes.com>
Subject Re: What's the best modeling approach for ordering events by date?
Date Fri, 15 Apr 2011 18:30:33 GMT
Hi.

So, the OPP will direct all activity for a range of keys to a particular
node (or set of nodes, in accordance with your replication factor).
 Depending on the volume of writes, this could be fine.  Depending on the
distribution of key values you write at any given time, it can also be fine.
 But if you're using the OPP, and your keys align with the time of receiving
the data, and your application writes that data as it receives it, you're
going to be placing write activity on effectively one node at a time, for
the range of time allocated to that node.

If you use RP, and can divide time into finer slices such that you have
multiple tweets in a row, you trade off a more complex read in exchange for
better distribution of load throughout your cluster.  The necessity of this
depends on your particulars.

In your TweetsBySecond example, you're using a deterministic set of keys
(the keys correspond to seconds since epoch).  Querying for ranges of time
is nice with OPP, but if the ranges of time you're interested in are
constrained, you don't specifically need OPP.  You could use RP and request
all the keys for the seconds contained within the time range of interest.
 In this way, you balance writes across the cluster more effectively than
you would with OPP, while still getting a workable data set.  Again, the
degree to which you need this is dependent on your situation.  Others on the
list will no doubt have more informed opinions on this than me.  :)

On Thu, Apr 14, 2011 at 8:00 PM, Guillermo Winkler <gwinkler@inconcertcc.com
> wrote:

> Hi Ethan,
>
> I want to present the events ordered by time, always in pages of 20/40
> events. If the events are tweets, you can have 1000 tweets from the same
> second or you can have 30 tweets in a 10 minute range. But I always wanna be
> able to page through the results in an orderly fashion.
>
> I think that using seconds since epoch it's what I'm doing, that is divide
> time into a fixed series of interval. Each second is an interval, and all of
> the events for that particular second are columns of that row.
>
> Again with tweets for easier visualizatoin
>
> TweetsBySecond : {
>  12121121212 :{ -> seconds since epoch
>  id1,id2,id3 -> all the tweet ids ocurred in that particular second
> },
> 12121212123 : {
> id4,id5
> },
> 12121212124 : {
> id6
> }
> }
>
> The problem is you can't do that using OPP in cassandra 0.7, or it's just
> me missing something?
>
> Thanks for your answer,
> Guille
>
> On Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe <ethan@the-rowes.com> wrote:
>
>> How do you plan to read the data?  Entire histories, or in relatively
>> confined slices of time?  Do the events have any attributes by which you
>> might segregate them, apart from time?
>>
>> If you can divide time into a fixed series of intervals, you can insert
>> members of a given interval as columns (or supercolumns) in a row.  But it
>> depends how you want to use the data on the read side.
>>
>>
>> On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler <
>> gwinkler@inconcertcc.com> wrote:
>>
>>> I have a huge number of events I need to consume later, ordered by the
>>> date the event occured.
>>>
>>> My first approach to this problem was to use seconds since epoch as row
>>> key, and event ids as column names (empty value), this way:
>>>
>>> EventsByDate : {
>>>     SecondsSinceEpoch: {
>>>         evid:"", evid:"", evid:""
>>>     }
>>> }
>>>
>>> And use OPP as partitioner. Using GetRangeSlices to retrieve ordered
>>> events secuentially.
>>>
>>> Now I have two problems to solve:
>>>
>>> 1) The system is realtime, so all the events in a given moment are
>>> hitting the same box
>>> 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like
>>> LongType for row keys, was this purposedly deprecated?
>>>
>>> I was thinking about secondary indexes, but it does not assure the order
>>> the rows are coming out of cassandra.
>>>
>>> Anyone has a better approach to model events by date given that
>>> restrictions?
>>>
>>> Thanks,
>>> Guille
>>>
>>>
>>>
>>
>

Mime
View raw message