Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=to:from
	:subject:message-id:content-type:mime-version:in-reply-to:date;
	 q=dns; s=thelastpickle.com; b=LAQSYrzYcLfLD4vvB48Qidf1lboyCl7fR
	iQmpKnFPOzrqfeDnHrmrGOC0IOcaAkTloSG156vRjILVt5+1xGZf/hPYpmlxjRNk
	dOlC4sVtrnxD/XX3SUzW+c1Ns5LOjKTg6U/kZanF52TOpklVIW82cd2MLGgcAa5v
	HuUU+juU/8=
To: user@cassandra.apache.org
From: Aaron Morton <aaron@thelastpickle.com>
Subject: Re: Cassandra to store 1 billion small 64KB Blobs
Message-id: <79329e37-3f1e-4946-a807-81cdd86655f2@me.com>
Content-Type: multipart/alternative;
 boundary=Apple-Webmail-42--a1c7cff6-40e4-679e-8075-a24ecd522d40
MIME-Version: 1.0
In-Reply-To: <AANLkTinHg6E_eEcGM_JVLYbBzhGoT7Ctx4_bPMEr4s5B@mail.gmail.com>
Date: Sun, 25 Jul 2010 22:00:21 -0700 (PDT)


--Apple-Webmail-42--a1c7cff6-40e4-679e-8075-a24ecd522d40
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
    charset=utf-8;
    format=flowed

Some background reading.. http://ria101.wordpress.com/2010/02/22/cassandra=
-randompartitioner-vs-orderpreservingpartitioner/=0A=0ANot sure on your fo=
llow up question, so I'll just wildly blather on about things :)=0A=0AMy a=
ssumption of your data is you have 64K chunks that are identified by a has=
h, which can somehow be grouped together into larger files (so there is a =
"file name" of sorts).=0A=0AOne possible storage design (assuming the Rand=
om Partitioner) is....=0A=0AA Chunks CF, each row in this CF uses the hash=
 of the chunk as it's key and has is a single column with the chunk data. =
You could use more columns to store meta here.=0A=0AA ChunkIndex CF, each =
row uses the file name (from above) as the key and has one column for each=
 chunk in the file. The column name *could* be an offset for the chunk and=
 the column value could be the hash for the chunk. Or you could use the ch=
unk hash as the col name and the offset as the col value if needed.=0A=0AT=
o rebuild the file read the entire row from the ChunkIndex, then make a se=
ries of multi gets to read all the chunks. Or you could lazy populate the =
ones you needed.=EF=BB=BF=0A=0AThis is all assuming that the 1000's commen=
t below means you could want to combine the chunks  60+ MB chunks. It woul=
d be easier to keep all the chunks together in one row, if you are going t=
o have large (unbounded) file size this may not be appropriate.=0A=0AYou c=
ould also think about using the order preserving partitioner, and using a =
compound key for each row such as "file_name_hash.offset" . Then by using =
the get_range_slices to scan the range of chunks for a file you would not =
need to maintain a secondary index. Some drawbacks to that approach, read =
the article above.=0A=0AHope the helps=0AAaron=0A=0A=0AOn 26 Jul, 2010,at =
04:01 PM, Michael Widmann <michael.widmann@gmail.com> wrote:=0A=0A> Thanks=
 for this detailed description ...=0A>=0A> You mentioned the secondary ind=
ex in a standard column, would it be better to build several indizes?=0A> =
Is that even possible to build a index on for example 32 columns?=0A>=0A> =
The hint with the smaller boxes is very valuable!=0A>=0A> Mike=0A>=0A> 201=
0/7/26 Aaron Morton <aaron@thelastpickle.com>=0A>=0A>     For what it's wo=
rth...=0A>=0A>     * Many smaller boxes with local disk storage are prefer=
able to 2 with huge NAS storage.=0A>     * To cache the hash values look a=
t the KeysCached setting in the storage-config=0A>     * There are some ro=
w size limits see http://wiki.apache.org/cassandra/CassandraLimitations=0A=
>     * If you wanted to get 1000 blobs, rather then group them in a singl=
e row using a super column consider building a secondary index in a standa=
rd column. One CF for the blobs using your hash, one CF that uses whatever=
 they grouping key is with a col for every blobs hash value. Read from the=
 index first, then from the blobs themselves.=0A>=0A>     Aaron=0A>=0A>=0A=
>=0A>     On 24 Jul, 2010,at 06:51 PM, Michael Widmann <michael.widmann@gm=
ail.com> wrote:=0A>=0A>>     Hi Jonathan=0A>>=0A>>     Thanks for your ver=
y valuable input on this.=0A>>=0A>>     I maybe didn't enough explanation =
- so I'll try to clarify=0A>>=0A>>     Here are some thoughts:=0A>>=0A>>  =
       * binary data will not be indexed - only stored. =0A>>         * Th=
e file name to the binary data (a hash) should be indexed for search=0A>> =
        * We could group the hashes in 62 "entry" points for search retrie=
ving -> i think suprcolumns (If I'm right in terms) (a-z,A_Z,0-9)=0A>>    =
     * the 64k Blobs meta data (which one belong to which file) should be =
stored separate in cassandra=0A>>         * For Hardware we rely on solari=
s / opensolaris with ZFS in the backend=0A>>         * Write operations oc=
cur much more often than reads=0A>>         * Memory should hold the hash =
values mainly for fast search (not the binary data)=0A>>         * Read Op=
erations (restore from cassandra) may be async - (get about 1000 Blobs) - =
group them restore=0A>>=0A>>     So my question is too: =0A>>=0A>>     2 o=
r 3 Big boxes or 10 till 20 small boxes for storage...=0A>>     Could we s=
eparate "caching" - hash values CFs cashed and indexed - binary data CFs n=
ot ...=0A>>     Writes happens around the clock - on not that tremor speed=
 but constantly=0A>>     Would compaction of the database need really much=
 disk space=0A>>     Is it reliable on this size (more my fear)=0A>>=0A>> =
    thx for thinking and answers...=0A>>=0A>>     greetings=0A>>=0A>>     =
Mike=0A>>=0A>>     2010/7/23 Jonathan Shook <jshook@gmail.com>=0A>>=0A>>  =
       There are two scaling factors to consider here. In general the wors=
t=0A>>         case growth of operations in Cassandra is kept near to O(lo=
g2(N)). Any=0A>>         worse growth would be considered a design problem=
, or at least a high=0A>>         priority target for improvement.  This i=
s important for considering=0A>>         the load generated by very large =
column families, as binary search is=0A>>         used when the bloom filt=
er doesn't exclude rows from a query.=0A>>         O(log2(N)) is basically=
 the best achievable growth for this type of=0A>>         data, but the bl=
oom filter improves on it in some cases by paying a=0A>>         lower cos=
t every time.=0A>>=0A>>         The other factor to be aware of is the red=
uction of binary search=0A>>         performance for datasets which can pu=
t disk seek times into high=0A>>         ranges. This is mostly a direct c=
onsideration for those installations=0A>>         which will be doing lots=
 of cold reads (not cached data) against large=0A>>         sets. Disk see=
k times are much more limited (low) for adjacent or near=0A>>         trac=
ks, and generally much higher when tracks are sufficiently far=0A>>       =
  apart (as in a very large data set). This can compound with other=0A>>  =
       factors when session times are longer, but that is to be expected w=
ith=0A>>         any system. Your storage system may have completely diffe=
rent=0A>>         characteristics depending on caching, etc.=0A>>=0A>>    =
     The read performance is still quite high relative to other systems fo=
r=0A>>         a similar data set size, but the drop-off in performance ma=
y be much=0A>>         worse than expected if you are wanting it to be lin=
ear. Again, this is=0A>>         not unique to Cassandra. It's just an imp=
ortant consideration when=0A>>         dealing with extremely large sets o=
f data, when memory is not likely=0A>>         to be able to hold enough h=
ot data for the specific application.=0A>>=0A>>         As always, the rea=
l questions have lots more to do with your specific=0A>>         access pa=
tterns, storage system, etc I would look at the benchmarking=0A>>         =
info available on the lists as a good starting point.=0A>>=0A>>=0A>>      =
   On Fri, Jul 23, 2010 at 11:51 AM, Michael Widmann=0A>>         <michael=
widmann@gmail.com> wrote:=0A>>         > Hi=0A>>         >=0A>>         >=
 We plan to use cassandra as a data storage on at least 2 nodes with RF=3D=
2=0A>>         > for about 1 billion small files.=0A>>         > We do hav=
e about 48TB discspace behind for each node.=0A>>         >=0A>>         >=
 now my question is - is this possible with cassandra - reliable - means=0A=
>>         > (every blob is stored on 2 jbods)..=0A>>         >=0A>>      =
   > we may grow up to nearly 40TB or more on cassandra "storage" data ...=
=0A>>         >=0A>>         > anyone out did something similar?=0A>>     =
    >=0A>>         > for retrieval of the blobs we are going to index them=
 with an hashvalue=0A>>         > (means hashes are used to store the blob=
) ...=0A>>         > so we can search fast for the entry in the database a=
nd combine the blobs to=0A>>         > a normal file again ...=0A>>       =
  >=0A>>         > thanks for answer=0A>>         >=0A>>         > michael=
=0A>>         >=0A>>=0A>>=0A>>=0A>>=0A>>     -- =0A>>     bayoda.com - Pro=
fessional Online Backup Solutions for Small and Medium Sized Companies=0A>=
=0A>=0A>=0A>=0A> -- =0A> bayoda.com - Professional Online Backup Solutions=
 for Small and Medium Sized Companies=0A
--Apple-Webmail-42--a1c7cff6-40e4-679e-8075-a24ecd522d40
Content-Type: multipart/related;
    type="text/html";
    boundary=Apple-Webmail-86--a1c7cff6-40e4-679e-8075-a24ecd522d40


--Apple-Webmail-86--a1c7cff6-40e4-679e-8075-a24ecd522d40
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
    charset=ISO-8859-1;

<html><body><div>Some background reading.. http://ria101.wordpress.com/2010/02/22/cass=
andra-randompartitioner-vs-orderpreservingpartitioner/<br><br>Not sure on =
your follow up question, so I'll just wildly blather on about things :)<br=
></div><div><br>My assumption of your data is you have 64K chunks that are=
 identified by a hash, which can somehow be grouped together into larger f=
iles (so there is a "file name" of sorts). <br><br>One possible storage de=
sign (assuming the Random Partitioner) is....<br><br>A Chunks CF, each row=
 in this CF uses the hash of the chunk as it's key and has is a single col=
umn with the chunk data. You could use more columns to store meta here. <b=
r><br>A ChunkIndex CF, each row uses the file name (from above) as the key=
 and has one column for each chunk in the file. The column name *could* be=
 an offset for the chunk and the column value could be the hash for the ch=
unk. Or you could use the chunk hash as the col name and the offset as the=
 col value if needed. <br><br>To rebuild the file read the entire row from=
 the ChunkIndex, then make a series of multi gets to read all the chunks. =
Or you could lazy populate the ones you needed.<br><br>This is all assumin=
g that the 1000's comment below means you could want to combine the chunks=
&nbsp; 60+ MB chunks. It would be easier to keep all the chunks together i=
n one row, if you are going to have large (unbounded) file size this may n=
ot be appropriate. <br><br>You could also think about using the order pres=
erving partitioner, and using a compound key for each row such as "file_na=
me_hash.offset" . Then by using the get_range_slices to scan the range of =
chunks for a file you would not need to maintain a secondary index. Some d=
rawbacks to that approach, read the article above. <br><br>Hope the helps<=
br>Aaron<br><br><br>On 26 Jul, 2010,at 04:01 PM, Michael Widmann &lt;micha=
el.widmann@gmail.com&gt; wrote:<br><br><div><blockquote type=3D"cite"><div=
>Thanks for this detailed description ... <br><br>You mentioned the second=
ary index in a standard column, would it be better to build several indize=
s? <br>Is that even possible to build a index on for example 32 columns? <=
br>=0A<br>The hint with the smaller boxes is very valuable!<br><br>Mike <b=
r><br><div class=3D"gmail_quote">2010/7/26 Aaron Morton <span dir=3D"ltr">=
&lt;<a href=3D"mailto:aaron@thelastpickle.com" _mce_href=3D"mailto:aaron@t=
helastpickle.com">aaron@thelastpickle.com</a>&gt;</span><br>=0A<blockquote=
 class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, 204, 204);=
 margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" _mce_style=3D"border-left:=
 1px solid #cccccc; margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><d=
iv>For what it's worth...<br><br>* Many smaller boxes with local disk stor=
age are preferable to 2 with huge NAS storage. <br>=0A* To cache the hash =
values look at the KeysCached setting in the storage-config<br>* There are=
 some row size limits see <a href=3D"http://wiki.apache.org/cassandra/Cass=
andraLimitations" _mce_href=3D"http://wiki.apache.org/cassandra/CassandraL=
imitations" target=3D"_blank">http://wiki.apache.org/cassandra/CassandraLi=
mitations</a><br>=0A* If you wanted to get 1000 blobs, rather then group t=
hem in a single row using a super column consider building a secondary ind=
ex in a standard column. One CF for the blobs using your hash, one CF that=
 uses whatever they grouping key is with a col for every blobs hash value.=
 Read from the index first, then from the blobs themselves. <br>=0A</div><=
div><font color=3D"#888888"><br>Aaron</font><div><div><br></div><div class=
=3D"h5"><br><br>On 24 Jul, 2010,at 06:51 PM, Michael Widmann &lt;<a href=3D=
"mailto:michael.widmann@gmail.com" _mce_href=3D"mailto:michael.widmann@gma=
il.com" target=3D"_blank">michael.widmann@gmail.com</a>&gt; wrote:<br>=0A<=
br><div><blockquote type=3D"cite"><div>Hi Jonathan <br><br>Thanks for your=
 very valuable input on this. <br><br>I maybe didn't enough explanation - =
so I'll try to clarify <br><br>Here are some thoughts:<br><br><ul>=0A<li>b=
inary data will not be indexed - only stored.&nbsp; </li>=0A<li>The file n=
ame to the binary data (a hash) should be indexed for search</li><li>We co=
uld group the hashes in 62 "entry" points for search retrieving -&gt; i th=
ink suprcolumns (If I'm right in terms) (a-z,A_Z,0-9) </li>=0A=0A<li>the 6=
4k Blobs meta data (which one belong to which file) should be stored separ=
ate in cassandra </li><li>For Hardware we rely on solaris / opensolaris wi=
th ZFS in the backend</li><li>Write operations occur much more often than =
reads </li>=0A=0A<li>Memory should hold the hash values mainly for fast se=
arch (not the binary data) <br></li><li>Read Operations (restore from cass=
andra) may be async - (get about 1000 Blobs) - group them restore</li></ul=
>So my question is too:&nbsp; <br>=0A=0A<br>2 or 3 Big boxes or 10 till 20=
 small boxes for storage... <br>Could we separate "caching" - hash values =
CFs cashed and indexed - binary data CFs not ... <br>Writes happens around=
 the clock - on not that tremor speed but constantly <br>=0A=0AWould compa=
ction of the database need really much disk space <br>Is it reliable on th=
is size (more my fear) <br><br>thx for thinking and answers... <br><br>gre=
etings <br><br>Mike <br><br><div class=3D"gmail_quote">2010/7/23 Jonathan =
Shook <span dir=3D"ltr">&lt;<a href=3D"mailto:jshook@gmail.com" _mce_href=3D=
"mailto:jshook@gmail.com" target=3D"_blank">jshook@gmail.com</a>&gt;</span=
><br>=0A=0A<blockquote class=3D"gmail_quote" style=3D"border-left: 1px sol=
id rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;" _mce=
_style=3D"border-left: 1px solid #cccccc; margin: 0pt 0pt 0pt 0.8ex; paddi=
ng-left: 1ex;">There are two scaling factors to consider here. In general =
the worst<br>=0Acase growth of operations in Cassandra is kept near to O(l=
og2(N)). Any<br>=0Aworse growth would be considered a design problem, or a=
t least a high<br>=0Apriority target for improvement. &nbsp;This is import=
ant for considering<br>=0Athe load generated by very large column families=
, as binary search is<br>=0Aused when the bloom filter doesn't exclude row=
s from a query.<br>=0AO(log2(N)) is basically the best achievable growth f=
or this type of<br>=0Adata, but the bloom filter improves on it in some ca=
ses by paying a<br>=0Alower cost every time.<br>=0A<br>=0AThe other factor=
 to be aware of is the reduction of binary search<br>=0Aperformance for da=
tasets which can put disk seek times into high<br>=0Aranges. This is mostl=
y a direct consideration for those installations<br>=0Awhich will be doing=
 lots of cold reads (not cached data) against large<br>=0Asets. Disk seek =
times are much more limited (low) for adjacent or near<br>=0Atracks, and g=
enerally much higher when tracks are sufficiently far<br>=0Aapart (as in a=
 very large data set). This can compound with other<br>=0Afactors when ses=
sion times are longer, but that is to be expected with<br>=0Aany system. Y=
our storage system may have completely different<br>=0Acharacteristics dep=
ending on caching, etc.<br>=0A<br>=0AThe read performance is still quite h=
igh relative to other systems for<br>=0Aa similar data set size, but the d=
rop-off in performance may be much<br>=0Aworse than expected if you are wa=
nting it to be linear. Again, this is<br>=0Anot unique to Cassandra. It's =
just an important consideration when<br>=0Adealing with extremely large se=
ts of data, when memory is not likely<br>=0Ato be able to hold enough hot =
data for the specific application.<br>=0A<br>=0AAs always, the real questi=
ons have lots more to do with your specific<br>=0Aaccess patterns, storage=
 system, etc I would look at the benchmarking<br>=0Ainfo available on the =
lists as a good starting point.<br>=0A<div><div><br></div><div><br>=0AOn F=
ri, Jul 23, 2010 at 11:51 AM, Michael Widmann<br>=0A&lt;<a href=3D"mailto:=
michael.widmann@gmail.com" _mce_href=3D"mailto:michael.widmann@gmail.com" =
target=3D"_blank">michael.widmann@gmail.com</a>&gt; wrote:<br>=0A&gt; Hi<b=
r>=0A&gt;<br>=0A&gt; We plan to use cassandra as a data storage on at leas=
t 2 nodes with RF=3D2<br>=0A&gt; for about 1 billion small files.<br>=0A&g=
t; We do have about 48TB discspace behind for each node.<br>=0A&gt;<br>=0A=
&gt; now my question is - is this possible with cassandra - reliable - mea=
ns<br>=0A&gt; (every blob is stored on 2 jbods)..<br>=0A&gt;<br>=0A&gt; we=
 may grow up to nearly 40TB or more on cassandra "storage" data ...<br>=0A=
&gt;<br>=0A&gt; anyone out did something similar?<br>=0A&gt;<br>=0A&gt; fo=
r retrieval of the blobs we are going to index them with an hashvalue<br>=0A=
&gt; (means hashes are used to store the blob) ...<br>=0A&gt; so we can se=
arch fast for the entry in the database and combine the blobs to<br>=0A&gt=
; a normal file again ...<br>=0A&gt;<br>=0A&gt; thanks for answer<br>=0A&g=
t;<br>=0A&gt; michael<br>=0A&gt;<br>=0A</div></div></blockquote></div><br>=
<br clear=3D"all"><br>-- <br><a href=3D"http://bayoda.com" _mce_href=3D"ht=
tp://bayoda.com" target=3D"_blank">bayoda.com</a> - Professional Online Ba=
ckup Solutions for Small and Medium Sized Companies <br>=0A</div></blockqu=
ote></div></div></div></div></div></blockquote></div><br><br clear=3D"all"=
><br>-- <br><a href=3D"http://bayoda.com" _mce_href=3D"http://bayoda.com">=
bayoda.com</a> - Professional Online Backup Solutions for Small and Medium=
 Sized Companies <br>=0A=0A</div></blockquote></div></div></body></html>
--Apple-Webmail-86--a1c7cff6-40e4-679e-8075-a24ecd522d40--

--Apple-Webmail-42--a1c7cff6-40e4-679e-8075-a24ecd522d40--