Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 38D19642D for ; Thu, 16 Jun 2011 23:48:00 +0000 (UTC) Received: (qmail 84598 invoked by uid 500); 16 Jun 2011 23:47:57 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 84559 invoked by uid 500); 16 Jun 2011 23:47:57 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 84551 invoked by uid 99); 16 Jun 2011 23:47:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jun 2011 23:47:57 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ryan@twitter.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-iw0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jun 2011 23:47:51 +0000 Received: by iwn39 with SMTP id 39so2055110iwn.31 for ; Thu, 16 Jun 2011 16:47:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=twitter.com; s=google; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=xtP3idEej3hjAdA/RUOxKIwhFnzejasipFCZwTxEZqg=; b=onmhBfwm5eP4V84omZT8urKVgmcwCtyHy4UCKTGHdEpzXxN8o0xDvGQK2qzhq1zSTx 8UIpHAZZFnEYyZCl6u5JdeYCEKh78MQPg5V06tR+31pWD3WDRjjRnwdgG94hlbmb2Fkc nbMm9O1QHYrc3GzqHjbJi5CtwnV40gnB7l34U= DomainKey-Signature: a=rsa-sha1; c=nofws; d=twitter.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=UUz71ZYypoFULhqholTFZhZNou6f/Zj3xLjm81o3fwA3uIET3ffPtiKs22A2ojShID zJ+kzecbDqoHxbhPRs7KnfmZzY5Csf6xaISgAucX+/QRguY5NbqT73S8J6Et5/SIU/sT nRduS+bc/vAZPqKqS+3Ulusyz9Vf4ooA5cMhU= Received: by 10.42.173.9 with SMTP id p9mr1405522icz.268.1308268050174; Thu, 16 Jun 2011 16:47:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.42.219.8 with HTTP; Thu, 16 Jun 2011 16:47:10 -0700 (PDT) In-Reply-To: References: From: Ryan King Date: Thu, 16 Jun 2011 16:47:10 -0700 Message-ID: Subject: Re: compression for regular column names? To: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org On Thu, Jun 16, 2011 at 3:41 PM, E R wrote: > Hi all, > > As a way of gaining familiarity with Cassandra I am migrating a table > that is currently stored in a relational database and mapping it into > a Cassandra column family. We add about 700,000 new rows a day to this > table, and the average disk space used per row is ~ 300 bytes > including indexes. > > The mapping from table to column family is straight forward - there is > a one-one relationship between table columns and column family column > names. The relational table has 19 columns. The length of the names of > the columns is nearly 200 bytes whereas the average amount of data per > row is only 130 bytes. > > Initially I used the identify map for this translation - i.e. my > Cassandra column names were the same as the relational column names. I > then found out I could save a lot of disk space by using single letter > column names instead of the original relational names. I.e. use 'L' > instead of 'LINK_IDENTIFIER' for a column name. > > The procedure I use to determine space used is: > > 1. rm -rf the cassandra var-lib directory > 2. start cassandra, create keyspace, column families, etc. > 3. insert records > 4. stop cassandra > 5. re-start cassandra > 6. measure disk space with du -s the cassandra var-lib directory > > This seems to replace the commit logs with .db files. > > My questions are: > > 1. Is this a common practice (i.e. making the client responsible for > shortening the column names) when dealing with a large number of fixed > column names and a high volume of inserts? Is there any way that > Cassandra can help out here? Yes, we're working on a new, compressed format CASSANDRA-674. > 2. Is there another way to transform the commit logs into .db files > without stopping and starting the server? nodetool flush. -ryan