Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 28857 invoked from network); 26 Jul 2010 04:02:25 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Jul 2010 04:02:25 -0000 Received: (qmail 48361 invoked by uid 500); 26 Jul 2010 04:02:23 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 48197 invoked by uid 500); 26 Jul 2010 04:02:20 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 48189 invoked by uid 99); 26 Jul 2010 04:02:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jul 2010 04:02:19 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael.widmann@gmail.com designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jul 2010 04:02:13 +0000 Received: by bwz7 with SMTP id 7so3056558bwz.31 for ; Sun, 25 Jul 2010 21:01:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=vDiyOB/LYqr8q7nH+tcmi0Js8mLBmzAJsbCNnJOe78g=; b=HuWmwcNooEcBoGF/BzD0qbsrI/1yMr8hazYDTeZYmJhS2ESHiPK3vY2UckCIPjHLYH SU5v7XJyOnRxgwWnLc35A7608rhsnK8fwhWIxCnonODkgrfIKSE1M71w1iY3TGXnmUwQ XChaf6XhP7de494BS2a+8Ff9CJT5u0Je++/9c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=iHc/k0+b3fhDUNXr59rYWM3+/3jVLpvB9lRhFLZ0pV+3EH+krBlqcSBPMZJJiYwLEr cJxKpU+ZiKnciFx6UPMh3dH9D4py+aPh+sIT1YQ6cTb/DIPgPEtW5POnZ5tQE87j8R4k ri5irQ91c750RQFt1uK+2rB3qd/pvIF3UWyzQ= MIME-Version: 1.0 Received: by 10.204.35.5 with SMTP id n5mr5034798bkd.155.1280116911983; Sun, 25 Jul 2010 21:01:51 -0700 (PDT) Received: by 10.204.32.83 with HTTP; Sun, 25 Jul 2010 21:01:51 -0700 (PDT) In-Reply-To: <211e72fb-1bb1-cbe0-d7fc-8e84117907c3@me.com> References: <211e72fb-1bb1-cbe0-d7fc-8e84117907c3@me.com> Date: Mon, 26 Jul 2010 06:01:51 +0200 Message-ID: Subject: Re: Cassandra to store 1 billion small 64KB Blobs From: Michael Widmann To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0003255590aecf4c5f048c4271ff X-Virus-Checked: Checked by ClamAV on apache.org --0003255590aecf4c5f048c4271ff Content-Type: text/plain; charset=ISO-8859-1 Thanks for this detailed description ... You mentioned the secondary index in a standard column, would it be better to build several indizes? Is that even possible to build a index on for example 32 columns? The hint with the smaller boxes is very valuable! Mike 2010/7/26 Aaron Morton > For what it's worth... > > * Many smaller boxes with local disk storage are preferable to 2 with huge > NAS storage. > * To cache the hash values look at the KeysCached setting in the > storage-config > * There are some row size limits see > http://wiki.apache.org/cassandra/CassandraLimitations > * If you wanted to get 1000 blobs, rather then group them in a single row > using a super column consider building a secondary index in a standard > column. One CF for the blobs using your hash, one CF that uses whatever they > grouping key is with a col for every blobs hash value. Read from the index > first, then from the blobs themselves. > > Aaron > > > On 24 Jul, 2010,at 06:51 PM, Michael Widmann > wrote: > > Hi Jonathan > > Thanks for your very valuable input on this. > > I maybe didn't enough explanation - so I'll try to clarify > > Here are some thoughts: > > > - binary data will not be indexed - only stored. > - The file name to the binary data (a hash) should be indexed for > search > - We could group the hashes in 62 "entry" points for search retrieving > -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9) > - the 64k Blobs meta data (which one belong to which file) should be > stored separate in cassandra > - For Hardware we rely on solaris / opensolaris with ZFS in the backend > - Write operations occur much more often than reads > - Memory should hold the hash values mainly for fast search (not the > binary data) > - Read Operations (restore from cassandra) may be async - (get about > 1000 Blobs) - group them restore > > So my question is too: > > 2 or 3 Big boxes or 10 till 20 small boxes for storage... > Could we separate "caching" - hash values CFs cashed and indexed - binary > data CFs not ... > Writes happens around the clock - on not that tremor speed but constantly > Would compaction of the database need really much disk space > Is it reliable on this size (more my fear) > > thx for thinking and answers... > > greetings > > Mike > > 2010/7/23 Jonathan Shook > >> There are two scaling factors to consider here. In general the worst >> case growth of operations in Cassandra is kept near to O(log2(N)). Any >> worse growth would be considered a design problem, or at least a high >> priority target for improvement. This is important for considering >> the load generated by very large column families, as binary search is >> used when the bloom filter doesn't exclude rows from a query. >> O(log2(N)) is basically the best achievable growth for this type of >> data, but the bloom filter improves on it in some cases by paying a >> lower cost every time. >> >> The other factor to be aware of is the reduction of binary search >> performance for datasets which can put disk seek times into high >> ranges. This is mostly a direct consideration for those installations >> which will be doing lots of cold reads (not cached data) against large >> sets. Disk seek times are much more limited (low) for adjacent or near >> tracks, and generally much higher when tracks are sufficiently far >> apart (as in a very large data set). This can compound with other >> factors when session times are longer, but that is to be expected with >> any system. Your storage system may have completely different >> characteristics depending on caching, etc. >> >> The read performance is still quite high relative to other systems for >> a similar data set size, but the drop-off in performance may be much >> worse than expected if you are wanting it to be linear. Again, this is >> not unique to Cassandra. It's just an important consideration when >> dealing with extremely large sets of data, when memory is not likely >> to be able to hold enough hot data for the specific application. >> >> As always, the real questions have lots more to do with your specific >> access patterns, storage system, etc I would look at the benchmarking >> info available on the lists as a good starting point. >> >> >> On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann >> wrote: >> > Hi >> > >> > We plan to use cassandra as a data storage on at least 2 nodes with RF=2 >> > for about 1 billion small files. >> > We do have about 48TB discspace behind for each node. >> > >> > now my question is - is this possible with cassandra - reliable - means >> > (every blob is stored on 2 jbods).. >> > >> > we may grow up to nearly 40TB or more on cassandra "storage" data ... >> > >> > anyone out did something similar? >> > >> > for retrieval of the blobs we are going to index them with an hashvalue >> > (means hashes are used to store the blob) ... >> > so we can search fast for the entry in the database and combine the >> blobs to >> > a normal file again ... >> > >> > thanks for answer >> > >> > michael >> > >> > > > > -- > bayoda.com - Professional Online Backup Solutions for Small and Medium > Sized Companies > > -- bayoda.com - Professional Online Backup Solutions for Small and Medium Sized Companies --0003255590aecf4c5f048c4271ff Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks for this detailed description ...

You mentioned the secondar= y index in a standard column, would it be better to build several indizes? =
Is that even possible to build a index on for example 32 columns?

The hint with the smaller boxes is very valuable!

Mike

<= div class=3D"gmail_quote">2010/7/26 Aaron Morton <aaron@thelastpickle.com>
For wha= t it's worth...

* Many smaller boxes with local disk storage are= preferable to 2 with huge NAS storage.
* To cache the hash values look at the KeysCached setting in the storage-co= nfig
* There are some row size limits see http://wiki.apache.or= g/cassandra/CassandraLimitations
* If you wanted to get 1000 blobs, rather then group them in a single row u= sing a super column consider building a secondary index in a standard colum= n. One CF for the blobs using your hash, one CF that uses whatever they gro= uping key is with a col for every blobs hash value. Read from the index fir= st, then from the blobs themselves.

Aaron


On 24 Jul, 2010,at 06:51 PM, Michael Widmann <michael.widmann@gma= il.com> wrote:

Hi Jonathan

Thanks for your= very valuable input on this.

I maybe didn't enough explanation= - so I'll try to clarify

Here are some thoughts:

  • binary data will not be indexed - only stored.=A0
  • The file name to the binary data (a hash) should be indexed for search<= /li>
  • We could group the hashes in 62 "entry" points for search= retrieving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,= 0-9)
  • the 64k Blobs meta data (which one belong to which file) should be stor= ed separate in cassandra
  • For Hardware we rely on solaris / opensol= aris with ZFS in the backend
  • Write operations occur much more often= than reads
  • Memory should hold the hash values mainly for fast search (not the bina= ry data)
  • Read Operations (restore from cassandra) may be async= - (get about 1000 Blobs) - group them restore
So my question is t= oo:=A0

2 or 3 Big boxes or 10 till 20 small boxes for storage...
Could we = separate "caching" - hash values CFs cashed and indexed - binary = data CFs not ...
Writes happens around the clock - on not that tremor s= peed but constantly
Would compaction of the database need really much disk space
Is it reli= able on this size (more my fear)

thx for thinking and answers...
greetings

Mike

2010/7/23 J= onathan Shook <jshook@gmail.com>
There are two sca= ling factors to consider here. In general the worst
case growth of operations in Cassandra is kept near to O(log2(N)). Any
worse growth would be considered a design problem, or at least a high
priority target for improvement. =A0This is important for considering
the load generated by very large column families, as binary search is
used when the bloom filter doesn't exclude rows from a query.
O(log2(N)) is basically the best achievable growth for this type of
data, but the bloom filter improves on it in some cases by paying a
lower cost every time.

The other factor to be aware of is the reduction of binary search
performance for datasets which can put disk seek times into high
ranges. This is mostly a direct consideration for those installations
which will be doing lots of cold reads (not cached data) against large
sets. Disk seek times are much more limited (low) for adjacent or near
tracks, and generally much higher when tracks are sufficiently far
apart (as in a very large data set). This can compound with other
factors when session times are longer, but that is to be expected with
any system. Your storage system may have completely different
characteristics depending on caching, etc.

The read performance is still quite high relative to other systems for
a similar data set size, but the drop-off in performance may be much
worse than expected if you are wanting it to be linear. Again, this is
not unique to Cassandra. It's just an important consideration when
dealing with extremely large sets of data, when memory is not likely
to be able to hold enough hot data for the specific application.

As always, the real questions have lots more to do with your specific
access patterns, storage system, etc I would look at the benchmarking
info available on the lists as a good starting point.


On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann
<michael.= widmann@gmail.com> wrote:
> Hi
>
> We plan to use cassandra as a data storage on at least 2 nodes with RF= =3D2
> for about 1 billion small files.
> We do have about 48TB discspace behind for each node.
>
> now my question is - is this possible with cassandra - reliable - mean= s
> (every blob is stored on 2 jbods)..
>
> we may grow up to nearly 40TB or more on cassandra "storage"= data ...
>
> anyone out did something similar?
>
> for retrieval of the blobs we are going to index them with an hashvalu= e
> (means hashes are used to store the blob) ...
> so we can search fast for the entry in the database and combine the bl= obs to
> a normal file again ...
>
> thanks for answer
>
> michael
>



--
bayoda.com - Professional Online B= ackup Solutions for Small and Medium Sized Companies


--
bayoda.com - Pr= ofessional Online Backup Solutions for Small and Medium Sized Companies --0003255590aecf4c5f048c4271ff--