cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yoshiyuki kanno <nekota...@gmail.com>
Subject Re: Giant sets of ordered data
Date Thu, 03 Jun 2010 15:21:53 GMT
Hi

I think In this case (logging hard traffic) both of two idea can't scale
write operation in current Cassandra.
So wait for secondary index support.

2010/6/3 Jonathan Shook <jshook@gmail.com>

> Insert "if you want to use long values for keys and column names"
> above paragraph 2. I forgot that part.
>
> On Wed, Jun 2, 2010 at 1:29 PM, Jonathan Shook <jshook@gmail.com> wrote:
> > If you want to do range queries on the keys, you can use OPP to do this:
> > (example using UTF-8 lexicographic keys, with bursts split across rows
> > according to row size limits)
> >
> > Events: {
> >  "20100601.05.30.003": {
> >    "20100601.05.30.003": <value>
> >    "20100601.05.30.007": <value>
> >    ...
> >  }
> > }
> >
> > With a future version of Cassandra, you may be able to use the same
> > basic datatype for both key and column name, as keys will be binary
> > like the rest, I believe.
> >
> > I'm not aware of specific performance improvements when using OPP
> > range queries on keys vs iterating over known keys. I suspect (hope)
> > that round-tripping to the server should be reduced, which may be
> > significant. Does anybody have decent benchmarks that tell the
> > difference?
> >
> >
> > On Wed, Jun 2, 2010 at 11:53 AM, Ben Browning <ben324@gmail.com> wrote:
> >> With a traffic pattern like that, you may be better off storing the
> >> events of each burst (I'll call them group) in one or more keys and
> >> then storing these keys in the day key.
> >>
> >> EventGroupsPerDay: {
> >>  "20100601": {
> >>    123456789: "group123", // column name is timestamp group was
> >> received, column value is key
> >>    123456790: "group124"
> >>  }
> >> }
> >>
> >> EventGroups: {
> >>  "group123": {
> >>    123456789: "value1",
> >>    123456799: "value2"
> >>   }
> >> }
> >>
> >> If you think of Cassandra as a toolkit for building scalable indexes
> >> it seems to make the modeling a bit easier. In this case, you're
> >> building an index by day to lookup events that come in as groups. So,
> >> first you'd fetch the slice of columns for the day you're interested
> >> in to figure out which groups to look at then you'd fetch the events
> >> in those groups.
> >>
> >> There are plenty of alternate ways to divide up the data among rows
> >> also - you could use hour keys instead of days as an example.
> >>
> >> On Wed, Jun 2, 2010 at 11:57 AM, David Boxenhorn <david@lookin2.com>
> wrote:
> >>> Let's say you're logging events, and you have billions of events. What
> if
> >>> the events come in bursts, so within a day there are millions of
> events, but
> >>> they all come within microseconds of each other a few times a day? How
> do
> >>> you find the events that happened on a particular day if you can't
> store
> >>> them all in one row?
> >>>
> >>> On Wed, Jun 2, 2010 at 6:45 PM, Jonathan Shook <jshook@gmail.com>
> wrote:
> >>>>
> >>>> Either OPP by key, or within a row by column name. I'd suggest the
> latter.
> >>>> If you have structured data to stick under a column (named by the
> >>>> timestamp), then you can serialize and unserialize it yourself, or you
> >>>> can use a supercolumn. It's effectively the same thing.  Cassandra
> >>>> only provides the super column support as a convenience layer as it
is
> >>>> currently implemented. That may change in the future.
> >>>>
> >>>> You didn't make clear in your question why a standard column would be
> >>>> less suitable. I presumed you had layered structure within the
> >>>> timestamp, hence my response.
> >>>> How would you logically partition your dataset according to natural
> >>>> application boundaries? This will answer most of your question.
> >>>> If you have a dataset which can't be partitioned into a reasonable
> >>>> size row, then you may want to use OPP and key concatenation.
> >>>>
> >>>> What do you mean by giant?
> >>>>
> >>>> On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn <david@lookin2.com>
> >>>> wrote:
> >>>> > How do I handle giant sets of ordered data, e.g. by timestamps,
> which I
> >>>> > want
> >>>> > to access by range?
> >>>> >
> >>>> > I can't put all the data into a supercolumn, because it's loaded
> into
> >>>> > memory
> >>>> > at once, and it's too much data.
> >>>> >
> >>>> > Am I forced to use an order-preserving partitioner? I don't want
the
> >>>> > headache. Is there any other way?
> >>>> >
> >>>
> >>>
> >>
> >
>

Mime
View raw message