Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C92C710907 for ; Fri, 4 Oct 2013 07:55:00 +0000 (UTC) Received: (qmail 21041 invoked by uid 500); 4 Oct 2013 07:54:56 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 20953 invoked by uid 500); 4 Oct 2013 07:54:50 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 20329 invoked by uid 99); 4 Oct 2013 07:54:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Oct 2013 07:54:48 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of robert.vazan@gmail.com designates 74.125.82.173 as permitted sender) Received: from [74.125.82.173] (HELO mail-we0-f173.google.com) (74.125.82.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Oct 2013 07:54:42 +0000 Received: by mail-we0-f173.google.com with SMTP id u57so2718079wes.32 for ; Fri, 04 Oct 2013 00:54:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type; bh=Xk3B6WNPdQ6e4T616fiQJhC+bvhs25Bf1dK1KUEyS9M=; b=KrFqPGRZkdIzbCl2EQkA11B0H39XjOoq8xSeAudv6fL6b3+E7qh63ktL1MoGsPbDuu 8O2J/zDjNJjexlLs8XzhG3Y68OWcBj/ORhg1oAk6LrJONVclv9AFlulMhEyclRspLXZK Ojk7SdlruhAGxjFQCbmqG/bF3d6wxCTOlQTraRMhOC5jaI0eVYbQi3fLe5a9dHbgJroh EN0g7nz7MAixA7lZ8PdzcRQQx0tsorRngwALdISkbF7OdY1dygzMnV3HzEB6bKRqOrTu 9MEHu2zA0Il6xEdp85jOrtQQ2nbeyUPdVpelp7OljxBG0tSUZNGUycD4nmqRwp4GwZDG TBrw== X-Received: by 10.180.108.199 with SMTP id hm7mr6072901wib.31.1380873261379; Fri, 04 Oct 2013 00:54:21 -0700 (PDT) Received: from [192.168.2.3] (dial-92-52-40-199-orange.orange.sk. [92.52.40.199]) by mx.google.com with ESMTPSA id mw9sm13135289wib.0.1969.12.31.16.00.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 04 Oct 2013 00:54:20 -0700 (PDT) Message-ID: <524E742D.90203@gmail.com> Date: Fri, 04 Oct 2013 09:54:21 +0200 From: =?UTF-8?B?Um9iZXJ0IFZhxb5hbg==?= User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Re: Minimum row size / minimum data point size References: In-Reply-To: Content-Type: multipart/alternative; boundary="------------040306030500070408030700" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------040306030500070408030700 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit That spreadsheet doesn't take compression into account, which is very important in my case. Uncompressed, my data is going to require a petabyte of storage according to the spreadsheet. I am pretty sure I won't get that much storage to play with. The spreadsheet also shows that Cassandra wastes unbelievable amount of space on compaction. My experiments with LevelDB however show that it is possible for write-optimized database to use negligible compaction space. I am not sure how LevelDB does it. I guess it splits the larger sstables into smaller chunks and merges them incrementally. Anyway, does anybody know how densely can I store the data with Cassandra when compression is enabled? Would I have to implement some smart adaptive grouping to fit lots of records in one row or is there a simpler solution? Dňa 4. 10. 2013 1:56 Andrey Ilinykh wrote / napísal(a): > It may help. > https://docs.google.com/spreadsheet/ccc?key=0Atatq_AL3AJwdElwYVhTRk9KZF9WVmtDTDVhY0xPSmc#gid=0 > > > On Thu, Oct 3, 2013 at 1:31 PM, Robert Važan > wrote: > > I need to store one trillion data points. The data is highly > compressible down to 1 byte per data point using simple custom > compression combined with standard dictionary compression. What's > the most space-efficient way to store the data in Cassandra? How > much per-row overhead is there if I store one data point per row? > > The data is particularly hard to group. It's a large number of > time series with highly variable density. That makes it hard to > pack subsets of the data into meaningful column families / wide > rows. Is there a table layout scheme that would allow me to > approach the 1B per data point without forcing me to implement > complex abstraction layer on application level? > > --------------040306030500070408030700 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit That spreadsheet doesn't take compression into account, which is very important in my case. Uncompressed, my data is going to require a petabyte of storage according to the spreadsheet. I am pretty sure I won't get that much storage to play with.

The spreadsheet also shows that Cassandra wastes unbelievable amount of space on compaction. My experiments with LevelDB however show that it is possible for write-optimized database to use negligible compaction space. I am not sure how LevelDB does it. I guess it splits the larger sstables into smaller chunks and merges them incrementally.

Anyway, does anybody know how densely can I store the data with Cassandra when compression is enabled? Would I have to implement some smart adaptive grouping to fit lots of records in one row or is there a simpler solution?

Dňa 4. 10. 2013 1:56 Andrey Ilinykh wrote / napísal(a):


On Thu, Oct 3, 2013 at 1:31 PM, Robert Važan <robert.vazan@gmail.com> wrote:
I need to store one trillion data points. The data is highly compressible down to 1 byte per data point using simple custom compression combined with standard dictionary compression. What's the most space-efficient way to store the data in Cassandra? How much per-row overhead is there if I store one data point per row?

The data is particularly hard to group. It's a large number of time series with highly variable density. That makes it hard to pack subsets of the data into meaningful column families / wide rows. Is there a table layout scheme that would allow me to approach the 1B per data point without forcing me to implement complex abstraction layer on application level?



--------------040306030500070408030700--