Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 51479 invoked from network); 28 Jul 2009 09:25:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jul 2009 09:25:44 -0000 Received: (qmail 57102 invoked by uid 500); 28 Jul 2009 09:27:01 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 57056 invoked by uid 500); 28 Jul 2009 09:27:01 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 57047 invoked by uid 99); 28 Jul 2009 09:27:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 09:27:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of colin@mollenhour.com designates 208.106.250.144 as permitted sender) Received: from [208.106.250.144] (HELO mail.mollenhour.com) (208.106.250.144) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 09:26:50 +0000 DomainKey-Signature: a=rsa-sha1; c=nofws; s=all; d=mollenhour.com; q=dns; h=received:message-id:date:from:user-agent:mime-version:to:subject:content-type:content-transfer-encoding; b=AfsWWgCHeOwB2+tzvwwtdgoTpKfnxre/Q3I8ciNws89hSBkXoxVTX2Abhe+6rz3LZeoPRi9USCt+qyYKDV6xcA==; Received: from c-68-52-14-201.hsd1.tn.comcast.net [68.52.14.201] by mail.mollenhour.com with SMTP; Tue, 28 Jul 2009 02:26:11 -0700 Message-ID: <4A6EC441.8030208@mollenhour.com> Date: Tue, 28 Jul 2009 05:26:25 -0400 From: Colin Mollenhour User-Agent: Thunderbird 2.0.0.22 (Windows/20090605) MIME-Version: 1.0 To: cassandra-user@incubator.apache.org Subject: Greetings! Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi all, I am new to the Cassandra scene. I have watched presentations, read papers and articles, run the server with some basic usage, digested the thrift interface and disseminated as much info as possible with frying my brain with all of this stuff. I am working on a web app that will have some social networking aspects as well as some other features that involve lots of "event" records and while I have a good enough understanding to do some damage, I don't feel comfortable writing an app just yet.. I actually started the app with a PHP framework and a MySQL schema that isn't too complex and have started distilling it into a Cassandra schema as best I can but this is where I am getting stuck. I'm not sure if I'm trying to fit a square peg into a round hole or if I am just not lining it up right so perhaps you can help me? I've been going off of the "twitter" examples (Evan Weaver, Eric Florenzano) as my point of reference but have a few questions about specifics. Background: I have for the most part, "users", "journals", and "events". Events have one of several types (variable) and are either a start-end range or a single point in time and have various metadata. Journals have multiple events plus various metadata. In the lifetime of a journal I am estimating it will accrue 20k-60k events. Users have multiple journals and can share access to journals with other users. Users will own <10 journals but some users might share access to more than that at once. I'd like it to scale to as many users as we can get to sign up, potentially very very many, hence my interest in Cassandra :) I need to be able to fetch all or latest events with the following "queries": -A specific journal -All of a user's journals -A specific event type -A specific event type for a specific journal -A specific event type for all of a user's journals After much deliberation in trying to figure out how to do the above without having to loop through many many queries here is the schema I arrived at: http://bit.ly/6Hj9I If I am correct in my thinking, all of the above cases can be retrieved in one or two steps with the maximum number of queries being determined by the number of journals in question. Am I wrong to try to reduce the number of indexes and round-trips to the database by modeling this way? Some more general questions: My model assumes the use of get_slice_by_names with a potentially large number of keys, is that ok? Cassandra lacks transactions and increment methods, is there a way to generate unique user ids with just Cassandra as the authority that I am missing? Is it silly to use short column names for the sake of performance or storage efficiency? E.g. uid instead of user_id. I like verbose names... Thanks! Colin