Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 67826 invoked from network); 2 Jun 2010 18:31:57 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Jun 2010 18:31:57 -0000 Received: (qmail 45830 invoked by uid 500); 2 Jun 2010 18:31:56 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 45811 invoked by uid 500); 2 Jun 2010 18:31:56 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 45803 invoked by uid 99); 2 Jun 2010 18:31:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jun 2010 18:31:56 +0000 X-ASF-Spam-Status: No, hits=0.3 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jshook@gmail.com designates 209.85.212.44 as permitted sender) Received: from [209.85.212.44] (HELO mail-vw0-f44.google.com) (209.85.212.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jun 2010 18:31:50 +0000 Received: by vws11 with SMTP id 11so2976464vws.31 for ; Wed, 02 Jun 2010 11:31:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=/lRVMQ22PSyHy7weZP7ASiUYm4m/800xsDedQmfu/tE=; b=Aulhba5o42z/vPKwijoWRom1R/NJ7V/wiSa7sVAIX7Q0JnD/7l1ypAkr5IlLbWt6TF aG/Q1UlwBTPvk29RxvhbiCxlDE8ZqjYbeNFJsAgBYqQ7m/XKg1jk5ovPBgUtRh90MKIt pM698/ZJ/HOhRMZyQO6+DAeKQwZ9z6Lm5PDAA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=hdTR6U5Gewt6/x9hnL3l1L67PcKjf65+0hpmDwn3QQdJt7rHVoFIQXpArCEkz5WJeb RSEIE1z5jjnbsqpNgl8SVNM9uL/7F4jqrzOewZuZwbT/Z0fD1Y6RW+7SU6ARbDU/gQBB Y0lFS3BaQC71IfQp3uctxCCuamMPsbbkqDDUA= MIME-Version: 1.0 Received: by 10.224.65.10 with SMTP id g10mr3818212qai.323.1275503486333; Wed, 02 Jun 2010 11:31:26 -0700 (PDT) Received: by 10.229.95.132 with HTTP; Wed, 2 Jun 2010 11:31:24 -0700 (PDT) In-Reply-To: References: Date: Wed, 2 Jun 2010 13:31:24 -0500 Message-ID: Subject: Re: Giant sets of ordered data From: Jonathan Shook To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Insert "if you want to use long values for keys and column names" above paragraph 2. I forgot that part. On Wed, Jun 2, 2010 at 1:29 PM, Jonathan Shook wrote: > If you want to do range queries on the keys, you can use OPP to do this: > (example using UTF-8 lexicographic keys, with bursts split across rows > according to row size limits) > > Events: { > =A0"20100601.05.30.003": { > =A0 =A0"20100601.05.30.003": > =A0 =A0"20100601.05.30.007": > =A0 =A0... > =A0} > } > > With a future version of Cassandra, you may be able to use the same > basic datatype for both key and column name, as keys will be binary > like the rest, I believe. > > I'm not aware of specific performance improvements when using OPP > range queries on keys vs iterating over known keys. I suspect (hope) > that round-tripping to the server should be reduced, which may be > significant. Does anybody have decent benchmarks that tell the > difference? > > > On Wed, Jun 2, 2010 at 11:53 AM, Ben Browning wrote: >> With a traffic pattern like that, you may be better off storing the >> events of each burst (I'll call them group) in one or more keys and >> then storing these keys in the day key. >> >> EventGroupsPerDay: { >> =A0"20100601": { >> =A0 =A0123456789: "group123", // column name is timestamp group was >> received, column value is key >> =A0 =A0123456790: "group124" >> =A0} >> } >> >> EventGroups: { >> =A0"group123": { >> =A0 =A0123456789: "value1", >> =A0 =A0123456799: "value2" >> =A0 } >> } >> >> If you think of Cassandra as a toolkit for building scalable indexes >> it seems to make the modeling a bit easier. In this case, you're >> building an index by day to lookup events that come in as groups. So, >> first you'd fetch the slice of columns for the day you're interested >> in to figure out which groups to look at then you'd fetch the events >> in those groups. >> >> There are plenty of alternate ways to divide up the data among rows >> also - you could use hour keys instead of days as an example. >> >> On Wed, Jun 2, 2010 at 11:57 AM, David Boxenhorn wro= te: >>> Let's say you're logging events, and you have billions of events. What = if >>> the events come in bursts, so within a day there are millions of events= , but >>> they all come within microseconds of each other a few times a day? How = do >>> you find the events that happened on a particular day if you can't stor= e >>> them all in one row? >>> >>> On Wed, Jun 2, 2010 at 6:45 PM, Jonathan Shook wrote= : >>>> >>>> Either OPP by key, or within a row by column name. I'd suggest the lat= ter. >>>> If you have structured data to stick under a column (named by the >>>> timestamp), then you can serialize and unserialize it yourself, or you >>>> can use a supercolumn. It's effectively the same thing. =A0Cassandra >>>> only provides the super column support as a convenience layer as it is >>>> currently implemented. That may change in the future. >>>> >>>> You didn't make clear in your question why a standard column would be >>>> less suitable. I presumed you had layered structure within the >>>> timestamp, hence my response. >>>> How would you logically partition your dataset according to natural >>>> application boundaries? This will answer most of your question. >>>> If you have a dataset which can't be partitioned into a reasonable >>>> size row, then you may want to use OPP and key concatenation. >>>> >>>> What do you mean by giant? >>>> >>>> On Wed, Jun 2, 2010 at 10:32 AM, David Boxenhorn >>>> wrote: >>>> > How do I handle giant sets of ordered data, e.g. by timestamps, whic= h I >>>> > want >>>> > to access by range? >>>> > >>>> > I can't put all the data into a supercolumn, because it's loaded int= o >>>> > memory >>>> > at once, and it's too much data. >>>> > >>>> > Am I forced to use an order-preserving partitioner? I don't want the >>>> > headache. Is there any other way? >>>> > >>> >>> >> >