Return-Path: Delivered-To: apmail-cassandra-dev-archive@www.apache.org Received: (qmail 86627 invoked from network); 1 Oct 2010 14:43:30 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Oct 2010 14:43:30 -0000 Received: (qmail 82284 invoked by uid 500); 1 Oct 2010 14:43:30 -0000 Delivered-To: apmail-cassandra-dev-archive@cassandra.apache.org Received: (qmail 82049 invoked by uid 500); 1 Oct 2010 14:43:27 -0000 Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list dev@cassandra.apache.org Received: (qmail 82033 invoked by uid 99); 1 Oct 2010 14:43:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Oct 2010 14:43:26 +0000 X-ASF-Spam-Status: No, hits=1.0 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (athena.apache.org: transitioning domain of sylvain@yakaz.com does not designate 209.85.216.172 as permitted sender) Received: from [209.85.216.172] (HELO mail-qy0-f172.google.com) (209.85.216.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Oct 2010 14:43:21 +0000 Received: by qyk7 with SMTP id 7so854831qyk.10 for ; Fri, 01 Oct 2010 07:43:00 -0700 (PDT) Received: by 10.224.59.91 with SMTP id k27mr3696350qah.177.1285944180456; Fri, 01 Oct 2010 07:43:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.99.211 with HTTP; Fri, 1 Oct 2010 07:42:40 -0700 (PDT) In-Reply-To: References: <3CCCC121-BD60-4D3B-B7AA-353CEAB9C241@oskarsson.nu> <4C0416C5-6422-48D0-9055-092543C47C42@oskarsson.nu> <0073704B-BDD4-44D1-8CCD-44C9B084A3EF@gmail.com> From: Sylvain Lebresne Date: Fri, 1 Oct 2010 16:42:40 +0200 Message-ID: Subject: Re: [DISCUSSION] High-volume counters in Cassandra To: dev@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 On Thu, Sep 30, 2010 at 6:29 PM, Ryan King wrote: > On Tue, Sep 28, 2010 at 10:14 PM, Jonathan Ellis wrote: >> On Tue, Sep 28, 2010 at 4:00 PM, Sylvain Lebresne wrote: >>> I agree that it is worth adding a support for counter as supercolumns >>> in 1546 and that's fairly trivial, so I will add that as soon as possible >>> (but please understand that I'm working on this for a good part during >>> my free time). >>> >>> As for supercolumns of counters, there is what Jonathan proposes, but >>> I'll add that encoding a supercolumns CF to a standard column CF is >>> almost always a fairly trivial encoding. Worst case scenario it requires >>> you to roll up your own comparator and it's slightly less convenient >>> for client code. >> >> Supporting supercolumns to allow multiple counters per row, but >> requiring encoding with a custom comparator for deeper nesting, seems >> like a reasonable compromise to me. > > I don't understand how this would work. This is a general thing, not related to counter. Let's get a little bit precise. The idea is that you will encode the following super row: key: { scol1 : { col1 : v1, col2 : v2, col3 : v3 }, scol2 : { col4 : v4, col5 : v5 } } as the standard row: key : { scol1|col1 : v1, scol1|col2 : v2, scol1|col3 : v3, scol2|col4 : v4, scol2|col5 : v5 } To get slightly more technical, scol1|col1 could be: [length of scol1][scol1 bytes][0][col1 bytes] And by that I mean that the bytes of the column in the encoding will start by 4 bytes for the length of scol1 (a byte[]), then scol1, then a 0 byte, then col1. The reason for the 0 byte after the super column name is for slice queries, to express the end of the super column (in the encoding). More precisely, for slice queries, the start of the scol1 is [length of scol1][scol1 bytes][0] and the end of scol1 is [length of scol1][scol1 bytes][1] The custom comparator is fairly easy to write. It takes the super column comparator (comp1) and the column comparator (comp2). To compare two (encoded) keys, it first read the super column name of the two keys (using the size at the start to each key) and compare them with comp1. If there are not equal, return the comparison value. Otherwise, read the next byte of each key. If unequal, biggest key is the one with the 1. If equal, read the two columns name and compare using comp2. Translating slice predicates is fairly trivial. You'll just have to iterate over the result to regroup the columns into super columns, but no biggy. The only thing that is less efficient is querying super columns by names (querying sub columns by name is fine however). For that, you'll have to issue one slice query for each requested name. Some remove operation could probably also be slightly less efficient, but in the end removes is broken with counters (both in 1072 and 1546, I'll refer you to the comments of this last ticket), so it's not a big deal. To sum up, I can see the following drawbacks to such encoding: - querying SC by names is less efficient. - it takes more disk space (but that's the cheapest resource we have isn't it). They have however at least one advantage: - your super columns are indexed, you don't have to deserialize them entirely each time. I'd say these are fair compromises. -- Sylvain