Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 38642 invoked from network); 28 Aug 2010 15:46:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 Aug 2010 15:46:02 -0000 Received: (qmail 64693 invoked by uid 500); 28 Aug 2010 15:46:00 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 64653 invoked by uid 500); 28 Aug 2010 15:46:00 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 64645 invoked by uid 99); 28 Aug 2010 15:46:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Aug 2010 15:46:00 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of decker.christian@gmail.com designates 209.85.216.179 as permitted sender) Received: from [209.85.216.179] (HELO mail-qy0-f179.google.com) (209.85.216.179) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Aug 2010 15:45:52 +0000 Received: by qyk9 with SMTP id 9so3831476qyk.10 for ; Sat, 28 Aug 2010 08:45:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:from:date :message-id:subject:to:content-type; bh=WgVgtxIyDCotYM4veZQLeudN+lpBqZ+aIRyotY72wxQ=; b=NU7y1u0V5hE6ELx2F/zrQ/4piYCn10r/cQx1vCDXi1a5Q/Vd38porpu8oKGDk9fTaG cljYOqImR2u+k4Vf0hq6spssruCgZeXZfQcB5O/FmH+sNJ7o+z7MxSiytJ2GWRN3cB8i KR1aWRVekaILCo05VKVsHaTrN4if09p4VqfLA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type; b=XWZGX5+rBPoDRQqLqSdVOOHAyeEr8zcWFkcepCfj2Va6SHTS5vnx8QNLG3TdKlaZz0 p+0jj9mVkC+JVH1HqSgVFvmj3zQKRYxEkgrI62RonCIEeea2fZI7bCzm7rjQVicjvQwJ +W4Q2d6n2/WPeZvPZfl4dFxp/XU5JGmrP7vpQ= Received: by 10.224.105.199 with SMTP id u7mr1457481qao.131.1283010330126; Sat, 28 Aug 2010 08:45:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.110.5 with HTTP; Sat, 28 Aug 2010 08:45:09 -0700 (PDT) From: Christian Decker Date: Sat, 28 Aug 2010 17:45:09 +0200 Message-ID: Subject: Join & Range Query performance To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00c09f83a1c7f86ca5048ee41ef7 --00c09f83a1c7f86ca5048ee41ef7 Content-Type: text/plain; charset=ISO-8859-1 I'm wondering what the performance considerations are on Join-like queries. I have a ColumnFamily that holds millions of records (not unusual as I understand) and I want to work on them using Pig and Hadoop. Until now we always fetched all rows in Cassandra and just filtered and worked on them. The idea now is to introduce indices to speed up some of these analysis. Let's assume we have page hits, each of them has a user associated and many of our queries work on the users, so creating a ColumnFamily whose key is the user id would be logic, but that would mean that we'd store all the data twice (once in the all encompassing ColumnFamily and once as SubcolumnFamilies in the Index) and since we might insert additional indices it would multiply our data size. Usually in a relational world we'd not save the data in the index, but a pointer to the real entry. Would it be wise to just store the key of the item that is referenced and then iteratively fetch them from the cluster? Also I'd like to know how key range queries perform against simple key lookups since I'd like to build a dynamic storage system which splits really large rows into smaller ones, by specifying one more byte of the key (so from a\0\0\0\0 we might got to a\0\0\0\0 - a\255\0\0\0, and then get all results by simply querying a\0\0\0\0 through a\255\255\255\255). I have no idea if this is even possible, just playing around with some ideas :D Regards, Chris --00c09f83a1c7f86ca5048ee41ef7 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I'm wondering what the performance considerations are on Join-like quer= ies.

I have a ColumnFamily that holds millions of records (n= ot unusual as I understand) and I want to work on them using Pig and Hadoop= . Until now we always fetched all rows in Cassandra and just filtered and w= orked on them. The idea now is to introduce indices to speed up some of the= se analysis. Let's assume we have page hits, each of them has a user as= sociated and many of our queries work on the users, so creating a ColumnFam= ily whose key is the user id would be logic, but that would mean that we= 9;d store all the data twice (once in the all encompassing ColumnFamily and= once as SubcolumnFamilies in the Index) and since we might insert addition= al indices it would multiply our data size.

Usually in a relational world we'd not save the dat= a in the index, but a pointer to the real entry. Would it be wise to just s= tore the key of the item that is referenced and then iteratively fetch them= from the cluster?

Also I'd like to know how key range queries perform= against simple key lookups since I'd like to build a dynamic storage s= ystem which splits really large rows into smaller ones, by specifying one m= ore byte of the key (so from a\0\0\0\0 we might got to a\0\0\0\0 - a\255\0\= 0\0, and then get all results by simply querying a\0\0\0\0 through a\255\25= 5\255\255).
I have no idea if this is even possible, just playing around with some= ideas :D

Regards,
Chris
--00c09f83a1c7f86ca5048ee41ef7--