Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 38479 invoked from network); 26 Jul 2010 05:00:57 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Jul 2010 05:00:57 -0000 Received: (qmail 84653 invoked by uid 500); 26 Jul 2010 05:00:55 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 84435 invoked by uid 500); 26 Jul 2010 05:00:52 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 84427 invoked by uid 99); 26 Jul 2010 05:00:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jul 2010 05:00:52 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a53.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jul 2010 05:00:44 +0000 Received: from homiemail-a53.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a53.g.dreamhost.com (Postfix) with ESMTP id 6A76F138069 for ; Sun, 25 Jul 2010 22:00:21 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=to:from :subject:message-id:content-type:mime-version:in-reply-to:date; q=dns; s=thelastpickle.com; b=LAQSYrzYcLfLD4vvB48Qidf1lboyCl7fR iQmpKnFPOzrqfeDnHrmrGOC0IOcaAkTloSG156vRjILVt5+1xGZf/hPYpmlxjRNk dOlC4sVtrnxD/XX3SUzW+c1Ns5LOjKTg6U/kZanF52TOpklVIW82cd2MLGgcAa5v HuUU+juU/8= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=to :from:subject:message-id:content-type:mime-version:in-reply-to: date; s=thelastpickle.com; bh=abOLxsGugNGjnFSCTPi6fybRS6Q=; b=wp BJctGsourO5hoR9BDcjwmuj+SyLS0H5jE7baHJmLC1yV38EaMMfXRB7TWgljneue 3Jss5hjfQCUAAYOlC1LxHSFZHGT+3VqKu7SYPCoOgwxlu2gfLcaqg/GHbjdDPwlN 9BiSWZ11wGv02AYJKIFkq3hPsMl/ThBbJbf5RINl0= Received: from localhost (webms.mac.com [17.148.16.116]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a53.g.dreamhost.com (Postfix) with ESMTPSA id 5B495138062 for ; Sun, 25 Jul 2010 22:00:21 -0700 (PDT) To: user@cassandra.apache.org From: Aaron Morton Subject: Re: Cassandra to store 1 billion small 64KB Blobs X-Mailer: MobileMe Mail (1C262608) Message-id: <79329e37-3f1e-4946-a807-81cdd86655f2@me.com> Content-Type: multipart/alternative; boundary=Apple-Webmail-42--a1c7cff6-40e4-679e-8075-a24ecd522d40 MIME-Version: 1.0 In-Reply-To: Date: Sun, 25 Jul 2010 22:00:21 -0700 (PDT) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Webmail-42--a1c7cff6-40e4-679e-8075-a24ecd522d40 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8; format=flowed Some background reading.. http://ria101.wordpress.com/2010/02/22/cassandra= -randompartitioner-vs-orderpreservingpartitioner/=0A=0ANot sure on your fo= llow up question, so I'll just wildly blather on about things :)=0A=0AMy a= ssumption of your data is you have 64K chunks that are identified by a has= h, which can somehow be grouped together into larger files (so there is a = "file name" of sorts).=0A=0AOne possible storage design (assuming the Rand= om Partitioner) is....=0A=0AA Chunks CF, each row in this CF uses the hash= of the chunk as it's key and has is a single column with the chunk data. = You could use more columns to store meta here.=0A=0AA ChunkIndex CF, each = row uses the file name (from above) as the key and has one column for each= chunk in the file. The column name *could* be an offset for the chunk and= the column value could be the hash for the chunk. Or you could use the ch= unk hash as the col name and the offset as the col value if needed.=0A=0AT= o rebuild the file read the entire row from the ChunkIndex, then make a se= ries of multi gets to read all the chunks. Or you could lazy populate the = ones you needed.=EF=BB=BF=0A=0AThis is all assuming that the 1000's commen= t below means you could want to combine the chunks 60+ MB chunks. It woul= d be easier to keep all the chunks together in one row, if you are going t= o have large (unbounded) file size this may not be appropriate.=0A=0AYou c= ould also think about using the order preserving partitioner, and using a = compound key for each row such as "file_name_hash.offset" . Then by using = the get_range_slices to scan the range of chunks for a file you would not = need to maintain a secondary index. Some drawbacks to that approach, read = the article above.=0A=0AHope the helps=0AAaron=0A=0A=0AOn 26 Jul, 2010,at = 04:01 PM, Michael Widmann wrote:=0A=0A> Thanks= for this detailed description ...=0A>=0A> You mentioned the secondary ind= ex in a standard column, would it be better to build several indizes?=0A> = Is that even possible to build a index on for example 32 columns?=0A>=0A> = The hint with the smaller boxes is very valuable!=0A>=0A> Mike=0A>=0A> 201= 0/7/26 Aaron Morton =0A>=0A> For what it's wo= rth...=0A>=0A> * Many smaller boxes with local disk storage are prefer= able to 2 with huge NAS storage.=0A> * To cache the hash values look a= t the KeysCached setting in the storage-config=0A> * There are some ro= w size limits see http://wiki.apache.org/cassandra/CassandraLimitations=0A= > * If you wanted to get 1000 blobs, rather then group them in a singl= e row using a super column consider building a secondary index in a standa= rd column. One CF for the blobs using your hash, one CF that uses whatever= they grouping key is with a col for every blobs hash value. Read from the= index first, then from the blobs themselves.=0A>=0A> Aaron=0A>=0A>=0A= >=0A> On 24 Jul, 2010,at 06:51 PM, Michael Widmann wrote:=0A>=0A>> Hi Jonathan=0A>>=0A>> Thanks for your ver= y valuable input on this.=0A>>=0A>> I maybe didn't enough explanation = - so I'll try to clarify=0A>>=0A>> Here are some thoughts:=0A>>=0A>> = * binary data will not be indexed - only stored. =0A>> * Th= e file name to the binary data (a hash) should be indexed for search=0A>> = * We could group the hashes in 62 "entry" points for search retrie= ving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)=0A>> = * the 64k Blobs meta data (which one belong to which file) should be = stored separate in cassandra=0A>> * For Hardware we rely on solari= s / opensolaris with ZFS in the backend=0A>> * Write operations oc= cur much more often than reads=0A>> * Memory should hold the hash = values mainly for fast search (not the binary data)=0A>> * Read Op= erations (restore from cassandra) may be async - (get about 1000 Blobs) - = group them restore=0A>>=0A>> So my question is too: =0A>>=0A>> 2 o= r 3 Big boxes or 10 till 20 small boxes for storage...=0A>> Could we s= eparate "caching" - hash values CFs cashed and indexed - binary data CFs n= ot ...=0A>> Writes happens around the clock - on not that tremor speed= but constantly=0A>> Would compaction of the database need really much= disk space=0A>> Is it reliable on this size (more my fear)=0A>>=0A>> = thx for thinking and answers...=0A>>=0A>> greetings=0A>>=0A>> = Mike=0A>>=0A>> 2010/7/23 Jonathan Shook =0A>>=0A>> = There are two scaling factors to consider here. In general the wors= t=0A>> case growth of operations in Cassandra is kept near to O(lo= g2(N)). Any=0A>> worse growth would be considered a design problem= , or at least a high=0A>> priority target for improvement. This i= s important for considering=0A>> the load generated by very large = column families, as binary search is=0A>> used when the bloom filt= er doesn't exclude rows from a query.=0A>> O(log2(N)) is basically= the best achievable growth for this type of=0A>> data, but the bl= oom filter improves on it in some cases by paying a=0A>> lower cos= t every time.=0A>>=0A>> The other factor to be aware of is the red= uction of binary search=0A>> performance for datasets which can pu= t disk seek times into high=0A>> ranges. This is mostly a direct c= onsideration for those installations=0A>> which will be doing lots= of cold reads (not cached data) against large=0A>> sets. Disk see= k times are much more limited (low) for adjacent or near=0A>> trac= ks, and generally much higher when tracks are sufficiently far=0A>> = apart (as in a very large data set). This can compound with other=0A>> = factors when session times are longer, but that is to be expected w= ith=0A>> any system. Your storage system may have completely diffe= rent=0A>> characteristics depending on caching, etc.=0A>>=0A>> = The read performance is still quite high relative to other systems fo= r=0A>> a similar data set size, but the drop-off in performance ma= y be much=0A>> worse than expected if you are wanting it to be lin= ear. Again, this is=0A>> not unique to Cassandra. It's just an imp= ortant consideration when=0A>> dealing with extremely large sets o= f data, when memory is not likely=0A>> to be able to hold enough h= ot data for the specific application.=0A>>=0A>> As always, the rea= l questions have lots more to do with your specific=0A>> access pa= tterns, storage system, etc I would look at the benchmarking=0A>> = info available on the lists as a good starting point.=0A>>=0A>>=0A>> = On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann=0A>> wrote:=0A>> > Hi=0A>> >=0A>> >= We plan to use cassandra as a data storage on at least 2 nodes with RF=3D= 2=0A>> > for about 1 billion small files.=0A>> > We do hav= e about 48TB discspace behind for each node.=0A>> >=0A>> >= now my question is - is this possible with cassandra - reliable - means=0A= >> > (every blob is stored on 2 jbods)..=0A>> >=0A>> = > we may grow up to nearly 40TB or more on cassandra "storage" data ...= =0A>> >=0A>> > anyone out did something similar?=0A>> = >=0A>> > for retrieval of the blobs we are going to index them= with an hashvalue=0A>> > (means hashes are used to store the blob= ) ...=0A>> > so we can search fast for the entry in the database a= nd combine the blobs to=0A>> > a normal file again ...=0A>> = >=0A>> > thanks for answer=0A>> >=0A>> > michael= =0A>> >=0A>>=0A>>=0A>>=0A>>=0A>> -- =0A>> bayoda.com - Pro= fessional Online Backup Solutions for Small and Medium Sized Companies=0A>= =0A>=0A>=0A>=0A> -- =0A> bayoda.com - Professional Online Backup Solutions= for Small and Medium Sized Companies=0A --Apple-Webmail-42--a1c7cff6-40e4-679e-8075-a24ecd522d40 Content-Type: multipart/related; type="text/html"; boundary=Apple-Webmail-86--a1c7cff6-40e4-679e-8075-a24ecd522d40 --Apple-Webmail-86--a1c7cff6-40e4-679e-8075-a24ecd522d40 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=ISO-8859-1;
Some background reading.. http://ria101.wordpress.com/2010/02/22/cass= andra-randompartitioner-vs-orderpreservingpartitioner/

Not sure on = your follow up question, so I'll just wildly blather on about things :)

My assumption of your data is you have 64K chunks that are= identified by a hash, which can somehow be grouped together into larger f= iles (so there is a "file name" of sorts).

One possible storage de= sign (assuming the Random Partitioner) is....

A Chunks CF, each row= in this CF uses the hash of the chunk as it's key and has is a single col= umn with the chunk data. You could use more columns to store meta here.
A ChunkIndex CF, each row uses the file name (from above) as the key= and has one column for each chunk in the file. The column name *could* be= an offset for the chunk and the column value could be the hash for the ch= unk. Or you could use the chunk hash as the col name and the offset as the= col value if needed.

To rebuild the file read the entire row from= the ChunkIndex, then make a series of multi gets to read all the chunks. = Or you could lazy populate the ones you needed.

This is all assumin= g that the 1000's comment below means you could want to combine the chunks=   60+ MB chunks. It would be easier to keep all the chunks together i= n one row, if you are going to have large (unbounded) file size this may n= ot be appropriate.

You could also think about using the order pres= erving partitioner, and using a compound key for each row such as "file_na= me_hash.offset" . Then by using the get_range_slices to scan the range of = chunks for a file you would not need to maintain a secondary index. Some d= rawbacks to that approach, read the article above.

Hope the helps<= br>Aaron


On 26 Jul, 2010,at 04:01 PM, Michael Widmann <micha= el.widmann@gmail.com> wrote:

Thanks for this detailed description ...

You mentioned the second= ary index in a standard column, would it be better to build several indize= s?
Is that even possible to build a index on for example 32 columns? <= br>=0A
The hint with the smaller boxes is very valuable!

Mike
2010/7/26 Aaron Morton = <aaron@thelastpickle.com>
=0A
For what it's worth...

* Many smaller boxes with local disk stor= age are preferable to 2 with huge NAS storage.
=0A* To cache the hash = values look at the KeysCached setting in the storage-config
* There are= some row size limits see http://wiki.apache.org/cassandra/CassandraLi= mitations
=0A* If you wanted to get 1000 blobs, rather then group t= hem in a single row using a super column consider building a secondary ind= ex in a standard column. One CF for the blobs using your hash, one CF that= uses whatever they grouping key is with a col for every blobs hash value.= Read from the index first, then from the blobs themselves.
=0A
<= div>
Aaron



On 24 Jul, 2010,at 06:51 PM, Michael Widmann <michael.widmann@gmail.com> wrote:
=0A<= br>
Hi Jonathan

Thanks for your= very valuable input on this.

I maybe didn't enough explanation - = so I'll try to clarify

Here are some thoughts:

    =0A
  • b= inary data will not be indexed - only stored. 
  • =0A
  • The file n= ame to the binary data (a hash) should be indexed for search
  • We co= uld group the hashes in 62 "entry" points for search retrieving -> i th= ink suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)
  • =0A=0A
  • the 6= 4k Blobs meta data (which one belong to which file) should be stored separ= ate in cassandra
  • For Hardware we rely on solaris / opensolaris wi= th ZFS in the backend
  • Write operations occur much more often than = reads
  • =0A=0A
  • Memory should hold the hash values mainly for fast se= arch (not the binary data)
  • Read Operations (restore from cass= andra) may be async - (get about 1000 Blobs) - group them restore
  • So my question is too: 
    =0A=0A
    2 or 3 Big boxes or 10 till 20= small boxes for storage...
    Could we separate "caching" - hash values = CFs cashed and indexed - binary data CFs not ...
    Writes happens around= the clock - on not that tremor speed but constantly
    =0A=0AWould compa= ction of the database need really much disk space
    Is it reliable on th= is size (more my fear)

    thx for thinking and answers...

    gre= etings

    Mike

    2010/7/23 Jonathan = Shook <jshook@gmail.com>
    =0A=0A
    There are two scaling factors to consider here. In general = the worst
    =0Acase growth of operations in Cassandra is kept near to O(l= og2(N)). Any
    =0Aworse growth would be considered a design problem, or a= t least a high
    =0Apriority target for improvement.  This is import= ant for considering
    =0Athe load generated by very large column families= , as binary search is
    =0Aused when the bloom filter doesn't exclude row= s from a query.
    =0AO(log2(N)) is basically the best achievable growth f= or this type of
    =0Adata, but the bloom filter improves on it in some ca= ses by paying a
    =0Alower cost every time.
    =0A
    =0AThe other factor= to be aware of is the reduction of binary search
    =0Aperformance for da= tasets which can put disk seek times into high
    =0Aranges. This is mostl= y a direct consideration for those installations
    =0Awhich will be doing= lots of cold reads (not cached data) against large
    =0Asets. Disk seek = times are much more limited (low) for adjacent or near
    =0Atracks, and g= enerally much higher when tracks are sufficiently far
    =0Aapart (as in a= very large data set). This can compound with other
    =0Afactors when ses= sion times are longer, but that is to be expected with
    =0Aany system. Y= our storage system may have completely different
    =0Acharacteristics dep= ending on caching, etc.
    =0A
    =0AThe read performance is still quite h= igh relative to other systems for
    =0Aa similar data set size, but the d= rop-off in performance may be much
    =0Aworse than expected if you are wa= nting it to be linear. Again, this is
    =0Anot unique to Cassandra. It's = just an important consideration when
    =0Adealing with extremely large se= ts of data, when memory is not likely
    =0Ato be able to hold enough hot = data for the specific application.
    =0A
    =0AAs always, the real questi= ons have lots more to do with your specific
    =0Aaccess patterns, storage= system, etc I would look at the benchmarking
    =0Ainfo available on the = lists as a good starting point.
    =0A


    =0AOn F= ri, Jul 23, 2010 at 11:51 AM, Michael Widmann
    =0A<michael.widmann@gmail.com> wrote:
    =0A> Hi=0A>
    =0A> We plan to use cassandra as a data storage on at leas= t 2 nodes with RF=3D2
    =0A> for about 1 billion small files.
    =0A&g= t; We do have about 48TB discspace behind for each node.
    =0A>
    =0A= > now my question is - is this possible with cassandra - reliable - mea= ns
    =0A> (every blob is stored on 2 jbods)..
    =0A>
    =0A> we= may grow up to nearly 40TB or more on cassandra "storage" data ...
    =0A= >
    =0A> anyone out did something similar?
    =0A>
    =0A> fo= r retrieval of the blobs we are going to index them with an hashvalue
    =0A= > (means hashes are used to store the blob) ...
    =0A> so we can se= arch fast for the entry in the database and combine the blobs to
    =0A>= ; a normal file again ...
    =0A>
    =0A> thanks for answer
    =0A&g= t;
    =0A> michael
    =0A>
    =0A

    =

    --
    bayoda.com - Professional Online Ba= ckup Solutions for Small and Medium Sized Companies
    =0A



--
= bayoda.com - Professional Online Backup Solutions for Small and Medium= Sized Companies
=0A=0A --Apple-Webmail-86--a1c7cff6-40e4-679e-8075-a24ecd522d40-- --Apple-Webmail-42--a1c7cff6-40e4-679e-8075-a24ecd522d40--