Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 75300 invoked from network); 27 May 2010 02:37:03 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 May 2010 02:37:03 -0000 Received: (qmail 36928 invoked by uid 500); 27 May 2010 02:37:01 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 36857 invoked by uid 500); 27 May 2010 02:37:01 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 36848 invoked by uid 99); 27 May 2010 02:37:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 May 2010 02:37:01 +0000 X-ASF-Spam-Status: No, hits=0.4 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jshook@gmail.com designates 209.85.221.192 as permitted sender) Received: from [209.85.221.192] (HELO mail-qy0-f192.google.com) (209.85.221.192) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 May 2010 02:36:57 +0000 Received: by qyk30 with SMTP id 30so10092607qyk.16 for ; Wed, 26 May 2010 19:36:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=XxZ1dRP13Nf4I53A7pIsS1zcXV3Sa0Vvc5x1QAeb0Pw=; b=BbzilGgekLDwdhhWYQsZp2RNvn6I3Gsf06g/yGcLOVWuJCeB2jfNygDmNYzFZZZ0lK 6tp60JqOqYpsSb5mDypH/qb8NMnp67GbC+LvOrFn6j28KzsNvPdOXNsY4MLq1J6kImJZ 1NVciYHjMy331Blq6StZMLY+0j2qX63j3Y1Ko= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Zg1TVEX3/6U2pMtn0ZwraFCRK30vWV2nFMa/tJaQBjqFcVJMe5egsRgN3kAhQ/eH9d 1NdzCUqkgO2Nmduky89y/l1Y4aO1z+MtFUgH10yzUV3A3FlXPq8T49mxjrLCK3g/Mooh GcWG/TG4EKe5bwH82ghpF4GneA0zeC+y9VCo8= MIME-Version: 1.0 Received: by 10.229.192.6 with SMTP id do6mr796810qcb.52.1274927795773; Wed, 26 May 2010 19:36:35 -0700 (PDT) Received: by 10.229.95.132 with HTTP; Wed, 26 May 2010 19:36:35 -0700 (PDT) In-Reply-To: References: Date: Wed, 26 May 2010 21:36:35 -0500 Message-ID: Subject: Re: Cassandra's 2GB row limit and indexing From: Jonathan Shook To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable The example is a little confusing. .. but .. 1) "sharding" You can square the capacity by having a 2-level map. CF1->row->value->CF2->row->value This means finding some natural subgrouping or hash that provides a good distribution. 2) "hashing" You can also use some additional key hashing to spread the rows over a wider space: Find a delimiter that works for you and identify the row that owns it by "domain" + "delimiter" + hash(domain) modulo some divisor, for example. 3) "overflow" You can implement some overflow logic to create overflow rows which act like (2), but is less sparse while count(columns) for candidate row > some threshold, try row + "delimiter" + subrow++ This is much easier when you are streaming data in, as opposed to poking the random value here and there Just some ideas. I'd go with 2, and find a way to adjust the modulo to minimize the row spread. 2) isn't guaranteed to provide uniformity, but 3) isn't guaranteed to provide very good performance. Perhaps a combination of them both? The count is readily accessible, so it may provide for some informed choices at run time. I'm assuming your column sizes are fairly predictable. Has anybody else tackled this before? On Wed, May 26, 2010 at 8:52 PM, Richard West wrote: > Hi all, > > I'm currently looking at new database options for a URL shortener in orde= r > to scale well with increased traffic as we add new features. Cassandra se= ems > to be a good fit for many of our requirements, but I'm struggling a bit t= o > find ways of designing certain indexes in Cassandra due to its 2GB row > limit. > > The easiest example of this is that I'd like to create an index by the > domain that shortened URLs are linking to, mostly for spam control so it'= s > easy to grab all the links to any given domain. As far as I can tell the > typical way to do this in Cassandra is something like: - > > DOMAIN =3D { //columnfamily > =A0=A0=A0 thing.com { //row key > =A0=A0=A0 =A0=A0=A0 timestamp: "shorturl567", //column name: value > =A0=A0=A0 =A0=A0=A0 timestamp: "shorturl144", > =A0=A0=A0 =A0=A0=A0 timestamp: "shorturl112", > =A0=A0=A0 =A0=A0=A0 ... > =A0=A0=A0 } > =A0=A0=A0 somethingelse.com { > =A0=A0=A0 =A0=A0=A0 timestamp: "shorturl817", > =A0=A0=A0 =A0=A0=A0 ... > =A0=A0=A0 } > } > > The values here are keys for another columnfamily containing various data= on > shortened URLs. > > The problem with this approach is that a popular domain (e.g. blogspot.co= m) > could be used in many millions of shortened URLs, so would have that many > columns and hit the row size limit mentioned at > http://wiki.apache.org/cassandra/CassandraLimitations. > > Does anyone know an effective way to design this type of one-to-many inde= x > around this limitation (could be something obvious I'm missing)? If not, = are > the changes proposed for https://issues.apache.org/jira/browse/CASSANDRA-= 16 > likely to make this type of design workable? > > Thanks in advance for any advice, > > Richard >