Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of michael.widmann@gmail.com
 designates 209.85.214.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=iHc/k0+b3fhDUNXr59rYWM3+/3jVLpvB9lRhFLZ0pV+3EH+krBlqcSBPMZJJiYwLEr
         cJxKpU+ZiKnciFx6UPMh3dH9D4py+aPh+sIT1YQ6cTb/DIPgPEtW5POnZ5tQE87j8R4k
         ri5irQ91c750RQFt1uK+2rB3qd/pvIF3UWyzQ=
MIME-Version: 1.0
In-Reply-To: <211e72fb-1bb1-cbe0-d7fc-8e84117907c3@me.com>
References: <AANLkTi=tiOj+PtV5jWyT73FTmsAsnNmnw2Me2MZOX788@mail.gmail.com>
	<211e72fb-1bb1-cbe0-d7fc-8e84117907c3@me.com>
Date: Mon, 26 Jul 2010 06:01:51 +0200
Message-ID: <AANLkTinHg6E_eEcGM_JVLYbBzhGoT7Ctx4_bPMEr4s5B@mail.gmail.com>
Subject: Re: Cassandra to store 1 billion small 64KB Blobs
From: Michael Widmann <michael.widmann@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0003255590aecf4c5f048c4271ff

--0003255590aecf4c5f048c4271ff
Content-Type: text/plain; charset=ISO-8859-1

Thanks for this detailed description ...

You mentioned the secondary index in a standard column, would it be better
to build several indizes?
Is that even possible to build a index on for example 32 columns?

The hint with the smaller boxes is very valuable!

Mike

2010/7/26 Aaron Morton <aaron@thelastpickle.com>

> For what it's worth...
>
> * Many smaller boxes with local disk storage are preferable to 2 with huge
> NAS storage.
> * To cache the hash values look at the KeysCached setting in the
> storage-config
> * There are some row size limits see
> http://wiki.apache.org/cassandra/CassandraLimitations
> * If you wanted to get 1000 blobs, rather then group them in a single row
> using a super column consider building a secondary index in a standard
> column. One CF for the blobs using your hash, one CF that uses whatever they
> grouping key is with a col for every blobs hash value. Read from the index
> first, then from the blobs themselves.
>
> Aaron
>
>
> On 24 Jul, 2010,at 06:51 PM, Michael Widmann <michael.widmann@gmail.com>
> wrote:
>
> Hi Jonathan
>
> Thanks for your very valuable input on this.
>
> I maybe didn't enough explanation - so I'll try to clarify
>
> Here are some thoughts:
>
>
>    - binary data will not be indexed - only stored.
>    - The file name to the binary data (a hash) should be indexed for
>    search
>    - We could group the hashes in 62 "entry" points for search retrieving
>    -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
>    - the 64k Blobs meta data (which one belong to which file) should be
>    stored separate in cassandra
>    - For Hardware we rely on solaris / opensolaris with ZFS in the backend
>    - Write operations occur much more often than reads
>    - Memory should hold the hash values mainly for fast search (not the
>    binary data)
>    - Read Operations (restore from cassandra) may be async - (get about
>    1000 Blobs) - group them restore
>
> So my question is too:
>
> 2 or 3 Big boxes or 10 till 20 small boxes for storage...
> Could we separate "caching" - hash values CFs cashed and indexed - binary
> data CFs not ...
> Writes happens around the clock - on not that tremor speed but constantly
> Would compaction of the database need really much disk space
> Is it reliable on this size (more my fear)
>
> thx for thinking and answers...
>
> greetings
>
> Mike
>
> 2010/7/23 Jonathan Shook <jshook@gmail.com>
>
>> There are two scaling factors to consider here. In general the worst
>> case growth of operations in Cassandra is kept near to O(log2(N)). Any
>> worse growth would be considered a design problem, or at least a high
>> priority target for improvement.  This is important for considering
>> the load generated by very large column families, as binary search is
>> used when the bloom filter doesn't exclude rows from a query.
>> O(log2(N)) is basically the best achievable growth for this type of
>> data, but the bloom filter improves on it in some cases by paying a
>> lower cost every time.
>>
>> The other factor to be aware of is the reduction of binary search
>> performance for datasets which can put disk seek times into high
>> ranges. This is mostly a direct consideration for those installations
>> which will be doing lots of cold reads (not cached data) against large
>> sets. Disk seek times are much more limited (low) for adjacent or near
>> tracks, and generally much higher when tracks are sufficiently far
>> apart (as in a very large data set). This can compound with other
>> factors when session times are longer, but that is to be expected with
>> any system. Your storage system may have completely different
>> characteristics depending on caching, etc.
>>
>> The read performance is still quite high relative to other systems for
>> a similar data set size, but the drop-off in performance may be much
>> worse than expected if you are wanting it to be linear. Again, this is
>> not unique to Cassandra. It's just an important consideration when
>> dealing with extremely large sets of data, when memory is not likely
>> to be able to hold enough hot data for the specific application.
>>
>> As always, the real questions have lots more to do with your specific
>> access patterns, storage system, etc I would look at the benchmarking
>> info available on the lists as a good starting point.
>>
>>
>> On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
>> <michael.widmann@gmail.com> wrote:
>> > Hi
>> >
>> > We plan to use cassandra as a data storage on at least 2 nodes with RF=2
>> > for about 1 billion small files.
>> > We do have about 48TB discspace behind for each node.
>> >
>> > now my question is - is this possible with cassandra - reliable - means
>> > (every blob is stored on 2 jbods)..
>> >
>> > we may grow up to nearly 40TB or more on cassandra "storage" data ...
>> >
>> > anyone out did something similar?
>> >
>> > for retrieval of the blobs we are going to index them with an hashvalue
>> > (means hashes are used to store the blob) ...
>> > so we can search fast for the entry in the database and combine the
>> blobs to
>> > a normal file again ...
>> >
>> > thanks for answer
>> >
>> > michael
>> >
>>
>
>
>
> --
> bayoda.com - Professional Online Backup Solutions for Small and Medium
> Sized Companies
>
>


-- 
bayoda.com - Professional Online Backup Solutions for Small and Medium Sized
Companies

--0003255590aecf4c5f048c4271ff
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thanks for this detailed description ... <br><br>You mentioned the secondar=
y index in a standard column, would it be better to build several indizes? =
<br>Is that even possible to build a index on for example 32 columns? <br>
<br>The hint with the smaller boxes is very valuable!<br><br>Mike <br><br><=
div class=3D"gmail_quote">2010/7/26 Aaron Morton <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:aaron@thelastpickle.com">aaron@thelastpickle.com</a>&gt;</spa=
n><br>
<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div>For wha=
t it&#39;s worth...<br><br>* Many smaller boxes with local disk storage are=
 preferable to 2 with huge NAS storage. <br>
* To cache the hash values look at the KeysCached setting in the storage-co=
nfig<br>* There are some row size limits see <a href=3D"http://wiki.apache.=
org/cassandra/CassandraLimitations" target=3D"_blank">http://wiki.apache.or=
g/cassandra/CassandraLimitations</a><br>
* If you wanted to get 1000 blobs, rather then group them in a single row u=
sing a super column consider building a secondary index in a standard colum=
n. One CF for the blobs using your hash, one CF that uses whatever they gro=
uping key is with a col for every blobs hash value. Read from the index fir=
st, then from the blobs themselves. <br>
</div><div><font color=3D"#888888"><br>Aaron</font><div><div></div><div cla=
ss=3D"h5"><br><br>On 24 Jul, 2010,at 06:51 PM, Michael Widmann &lt;<a href=
=3D"mailto:michael.widmann@gmail.com" target=3D"_blank">michael.widmann@gma=
il.com</a>&gt; wrote:<br>
<br><div><blockquote type=3D"cite"><div>Hi Jonathan <br><br>Thanks for your=
 very valuable input on this. <br><br>I maybe didn&#39;t enough explanation=
 - so I&#39;ll try to clarify <br><br>Here are some thoughts:<br><br><ul>
<li>binary data will not be indexed - only stored.=A0 </li>
<li>The file name to the binary data (a hash) should be indexed for search<=
/li><li>We could group the hashes in 62 &quot;entry&quot; points for search=
 retrieving -&gt; i think suprcolumns (If I&#39;m right in terms) (a-z,A_Z,=
0-9) </li>

<li>the 64k Blobs meta data (which one belong to which file) should be stor=
ed separate in cassandra </li><li>For Hardware we rely on solaris / opensol=
aris with ZFS in the backend</li><li>Write operations occur much more often=
 than reads </li>

<li>Memory should hold the hash values mainly for fast search (not the bina=
ry data) <br></li><li>Read Operations (restore from cassandra) may be async=
 - (get about 1000 Blobs) - group them restore</li></ul>So my question is t=
oo:=A0 <br>

<br>2 or 3 Big boxes or 10 till 20 small boxes for storage... <br>Could we =
separate &quot;caching&quot; - hash values CFs cashed and indexed - binary =
data CFs not ... <br>Writes happens around the clock - on not that tremor s=
peed but constantly <br>

Would compaction of the database need really much disk space <br>Is it reli=
able on this size (more my fear) <br><br>thx for thinking and answers... <b=
r><br>greetings <br><br>Mike <br><br><div class=3D"gmail_quote">2010/7/23 J=
onathan Shook <span dir=3D"ltr">&lt;<a href=3D"mailto:jshook@gmail.com" tar=
get=3D"_blank">jshook@gmail.com</a>&gt;</span><br>

<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">There are two sca=
ling factors to consider here. In general the worst<br>
case growth of operations in Cassandra is kept near to O(log2(N)). Any<br>
worse growth would be considered a design problem, or at least a high<br>
priority target for improvement. =A0This is important for considering<br>
the load generated by very large column families, as binary search is<br>
used when the bloom filter doesn&#39;t exclude rows from a query.<br>
O(log2(N)) is basically the best achievable growth for this type of<br>
data, but the bloom filter improves on it in some cases by paying a<br>
lower cost every time.<br>
<br>
The other factor to be aware of is the reduction of binary search<br>
performance for datasets which can put disk seek times into high<br>
ranges. This is mostly a direct consideration for those installations<br>
which will be doing lots of cold reads (not cached data) against large<br>
sets. Disk seek times are much more limited (low) for adjacent or near<br>
tracks, and generally much higher when tracks are sufficiently far<br>
apart (as in a very large data set). This can compound with other<br>
factors when session times are longer, but that is to be expected with<br>
any system. Your storage system may have completely different<br>
characteristics depending on caching, etc.<br>
<br>
The read performance is still quite high relative to other systems for<br>
a similar data set size, but the drop-off in performance may be much<br>
worse than expected if you are wanting it to be linear. Again, this is<br>
not unique to Cassandra. It&#39;s just an important consideration when<br>
dealing with extremely large sets of data, when memory is not likely<br>
to be able to hold enough hot data for the specific application.<br>
<br>
As always, the real questions have lots more to do with your specific<br>
access patterns, storage system, etc I would look at the benchmarking<br>
info available on the lists as a good starting point.<br>
<div><div><br></div><div><br>
On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann<br>
&lt;<a href=3D"mailto:michael.widmann@gmail.com" target=3D"_blank">michael.=
widmann@gmail.com</a>&gt; wrote:<br>
&gt; Hi<br>
&gt;<br>
&gt; We plan to use cassandra as a data storage on at least 2 nodes with RF=
=3D2<br>
&gt; for about 1 billion small files.<br>
&gt; We do have about 48TB discspace behind for each node.<br>
&gt;<br>
&gt; now my question is - is this possible with cassandra - reliable - mean=
s<br>
&gt; (every blob is stored on 2 jbods)..<br>
&gt;<br>
&gt; we may grow up to nearly 40TB or more on cassandra &quot;storage&quot;=
 data ...<br>
&gt;<br>
&gt; anyone out did something similar?<br>
&gt;<br>
&gt; for retrieval of the blobs we are going to index them with an hashvalu=
e<br>
&gt; (means hashes are used to store the blob) ...<br>
&gt; so we can search fast for the entry in the database and combine the bl=
obs to<br>
&gt; a normal file again ...<br>
&gt;<br>
&gt; thanks for answer<br>
&gt;<br>
&gt; michael<br>
&gt;<br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br><a href=3D"=
http://bayoda.com" target=3D"_blank">bayoda.com</a> - Professional Online B=
ackup Solutions for Small and Medium Sized Companies <br>
</div></blockquote></div></div></div></div></div></blockquote></div><br><br=
 clear=3D"all"><br>-- <br><a href=3D"http://bayoda.com">bayoda.com</a> - Pr=
ofessional Online Backup Solutions for Small and Medium Sized Companies <br=
>


--0003255590aecf4c5f048c4271ff--