Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of robert.vazan@gmail.com
 designates 74.125.82.173 as permitted sender)
Message-ID: <524E742D.90203@gmail.com>
Date: Fri, 04 Oct 2013 09:54:21 +0200
From: =?UTF-8?B?Um9iZXJ0IFZhxb5hbg==?= <robert.vazan@gmail.com>
User-Agent: Mozilla/5.0 (Windows NT 6.0;
 rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Re: Minimum row size / minimum data point size
References: 
 <CAHAitzRm2Mn4WZg_VsSxuU90GBK_0MDmWaz9OPsknNcqqjZk_w@mail.gmail.com>
 <CAK0tFt5GyMtPzkOCNOmSy5PCb8Wb9zFAgNsRTa8Z-a_A6OyXnA@mail.gmail.com>
In-Reply-To: 
 <CAK0tFt5GyMtPzkOCNOmSy5PCb8Wb9zFAgNsRTa8Z-a_A6OyXnA@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------040306030500070408030700"

This is a multi-part message in MIME format.
--------------040306030500070408030700
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

That spreadsheet doesn't take compression into account, which is very 
important in my case. Uncompressed, my data is going to require a 
petabyte of storage according to the spreadsheet. I am pretty sure I 
won't get that much storage to play with.

The spreadsheet also shows that Cassandra wastes unbelievable amount of 
space on compaction. My experiments with LevelDB however show that it is 
possible for write-optimized database to use negligible compaction 
space. I am not sure how LevelDB does it. I guess it splits the larger 
sstables into smaller chunks and merges them incrementally.

Anyway, does anybody know how densely can I store the data with 
Cassandra when compression is enabled? Would I have to implement some 
smart adaptive grouping to fit lots of records in one row or is there a 
simpler solution?

Dňa 4. 10. 2013 1:56 Andrey Ilinykh wrote / napísal(a):
> It may help.
> https://docs.google.com/spreadsheet/ccc?key=0Atatq_AL3AJwdElwYVhTRk9KZF9WVmtDTDVhY0xPSmc#gid=0
>
>
> On Thu, Oct 3, 2013 at 1:31 PM, Robert Važan <robert.vazan@gmail.com 
> <mailto:robert.vazan@gmail.com>> wrote:
>
>     I need to store one trillion data points. The data is highly
>     compressible down to 1 byte per data point using simple custom
>     compression combined with standard dictionary compression. What's
>     the most space-efficient way to store the data in Cassandra? How
>     much per-row overhead is there if I store one data point per row?
>
>     The data is particularly hard to group. It's a large number of
>     time series with highly variable density. That makes it hard to
>     pack subsets of the data into meaningful column families / wide
>     rows. Is there a table layout scheme that would allow me to
>     approach the 1B per data point without forcing me to implement
>     complex abstraction layer on application level?
>
>


--------------040306030500070408030700
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    That spreadsheet doesn't take compression into account, which is
    very important in my case. Uncompressed, my data is going to require
    a petabyte of storage according to the spreadsheet. I am pretty sure
    I won't get that much storage to play with.<br>
    <br>
    The spreadsheet also shows that Cassandra wastes unbelievable amount
    of space on compaction. My experiments with LevelDB however show
    that it is possible for write-optimized database to use negligible
    compaction space. I am not sure how LevelDB does it. I guess it
    splits the larger sstables into smaller chunks and merges them
    incrementally.<br>
    <br>
    Anyway, does anybody know how densely can I store the data with
    Cassandra when compression is enabled? Would I have to implement
    some smart adaptive grouping to fit lots of records in one row or is
    there a simpler solution?<br>
    <br>
    <div class="moz-cite-prefix">Dňa 4. 10. 2013 1:56 Andrey Ilinykh
      wrote / napísal(a):<br>
    </div>
    <blockquote
cite="mid:CAK0tFt5GyMtPzkOCNOmSy5PCb8Wb9zFAgNsRTa8Z-a_A6OyXnA@mail.gmail.com"
      type="cite">
      <div dir="ltr">It may help.
        <div><a moz-do-not-send="true"
href="https://docs.google.com/spreadsheet/ccc?key=0Atatq_AL3AJwdElwYVhTRk9KZF9WVmtDTDVhY0xPSmc#gid=0">https://docs.google.com/spreadsheet/ccc?key=0Atatq_AL3AJwdElwYVhTRk9KZF9WVmtDTDVhY0xPSmc#gid=0</a><br>
        </div>
      </div>
      <div class="gmail_extra"><br>
        <br>
        <div class="gmail_quote">On Thu, Oct 3, 2013 at 1:31 PM, Robert
          Važan <span dir="ltr">&lt;<a moz-do-not-send="true"
              href="mailto:robert.vazan@gmail.com" target="_blank">robert.vazan@gmail.com</a>&gt;</span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr">I need to store one trillion data points. The
              data is highly compressible down to 1 byte per data point
              using simple custom compression combined with standard
              dictionary compression. What's the most space-efficient
              way to store the data in Cassandra? How much per-row
              overhead is there if I store one data point per row?<br>
              <div><br>
              </div>
              <div>The data is particularly hard to group. It's a large
                number of time series with highly variable density. That
                makes it hard to pack subsets of the data into
                meaningful column families / wide rows. Is there a table
                layout scheme that would allow me to approach the 1B per
                data point without forcing me to implement complex
                abstraction layer on application level?<br>
                <br>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------040306030500070408030700--