cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From daemeon reiydelle <daeme...@gmail.com>
Subject Re: Cassandra as a key/object store for many small (10-60k) files
Date Fri, 05 May 2017 19:25:22 GMT
These numbers do not match e.g. AWS, so guessing you are using local
storage?


*.......*

*Making a billion dollar startup is easy: "take a human desire, preferably
one that has been around for a really long time … Identify that desire and
use modern technology to take out steps."*


*.......Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144
9872*

On Fri, May 5, 2017 at 12:19 PM, Jonathan Guberman <jg@tineye.com> wrote:

> Hello,
>
> We’re currently testing Cassandra for use as a pure key-object store for
> data blobs around 10kB - 60kB each. Our use case is storing on the order of
> 10 billion objects with about 5-20 million new writes per day. A written
> object will never be updated or deleted. Objects will be read at least
> once, some time within 10 days of being written. This will generally happen
> as a batch; that is, all of the images written on a particular day will be
> read together at the same time. This batch read will only happen one time;
> future reads will happen on individual objects, with no grouping, and they
> will follow a long-tail distribution, with popular objects read thousands
> of times per year but most read never or virtually never.
>
> I’ve set up a small four node test cluster and have written test scripts
> to benchmark writing and reading our data. The table I’ve set up is very
> simple: an ascii primary key column with the object ID and a blob column
> for the data. All other settings were left at their defaults.
>
> I’ve found write speeds to be very fast most of the time. However,
> periodically, writes will slow to a crawl for anywhere between half an hour
> to two hours, after which speeds recover to their previous levels. I assume
> this is some sort of data compaction or flushing to disk, but I haven’t
> been able to figure out the exact cause.
>
> Read speeds have been more disappointing. Cached reads are very fast, but
> random read speed averages about 2 MB/sec, which is too slow when we need
> to read out a batch of several million objects. I don’t think it’s
> reasonable to assume that these rows will all still be cached by the time
> we need to read them for that first large batch read.
>
> My general question is whether anyone has any suggestions for how to
> improve performance for our use case. More specifically:
>
> - Is there a way to mitigate or eliminate the huge slowdowns I see when
> writing millions of rows?
> - Are there settings I should be using in order to maximize read speeds
> for random reads?
> - Is there a way to design our tables to improve the read speeds for the
> initial large batched reads? I was thinking of using a batch ID column that
> could be used to retrieve the data for the initial block. However, future
> reads would need to be done by the object ID, not the batch ID, so it seems
> to me I’d need to duplicate the data, one in a “objects by batch” table,
> and the other in a simple “objects” table. Is there a better approach than
> this?
>
> Thank you!
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>

Mime
View raw message