incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <peter.schul...@infidyne.com>
Subject Re: Storing photos, images, docs etc.
Date Wed, 02 Mar 2011 08:18:18 GMT
> Is it advisable or ok to store photos, images and docs in cassandra where you
> expect high volume of uploads and views?

To diverge a bit from the direction the thread is going: You can
definitely store large files in Cassandra. I would recommend against
doing so by simply smacking entire files into column values simple
because the architecture is such that columns are assumed to be
reasonably sized (lots of them fitting in memory, lots of temporary
columns are okay to create, etc).

Off the top of my head my starting point would be using one row per
file and splitting the actual content up into columns. For dealing
with larger files you may wish to consider splitting into multiple
rows so that even individual files can get replicated across a cluster
(avoids single very large files causing out-of-disk or performance
problems on an individual node, and allows an individual file to enjoy
scaling out for performance).

However, all that is just deciding on the representation of data in
Cassandra appropriately for the use case. I think the more real and
bigger issue is what you're looking for in terms of efficiency. I
wouldn't necessarily call Cassandra the most efficient way to store
large blobs, just because compaction will be a lot more expensive in
relative terms than when used for small individual items of data.
However on the other hand Cassandra should shine in giving you
reasonably efficient random access to subranges of files, yet allow
you to easily write file data in a non-coordinated fashion
(concurrency across sub ranges). There are non-trivial trade-offs.

If you were to store say predominantly 5-50 MB files and you had no
desire beyond just storing them as single large blobs, a local storage
model which implied one-file-per-per would be much more efficient
assuming each individual blob could be streamed to the client.

Bottom line, I think the two primary potential concerns would be: Are
you looking at a *lot* of writes? Write overhead in terms of
throughput and disk I/O should be larger than for your typical
database with small "things" (regardless of row/column/supercolumn
division) being written. The other thing is that if compaction becomes
I/O bound rather than disk bound, you may have bigger issues with read
latency than otherwise.

Regardless, I don't think focusing on whether or not it's a good idea
to have a huge single column is the right approach to the problem
since that's more about using the Cassandra data model appropriately.

-- 
/ Peter Schuller

Mime
View raw message