incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henrik Schröder <skro...@gmail.com>
Subject Re: Range scan performance in 0.6.0 beta2
Date Thu, 25 Mar 2010 16:31:49 GMT
On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne <sylvain@yakaz.com> wrote:

> I don't know If that could play any role, but if ever you have
> disabled the assertions
> when running cassandra (that is, you removed the -ea line in
> cassandra.in.sh), there
> was a bug in 0.6beta2 that will make read in row with lots of columns
> quite slow.
>

We tried it with beta3 and got the same results, so that didn't do anything.


> Another problem you may have is if you have the commitLog directory on the
> same
> hard drive than the data directory. If that's the case and you read
> and write at the
> same time, that may be a reason for poor read performances (and write too).
>

We also tested doing only reads, and got about the same read speeds


> As for the row with 30 millions columns, you have to be aware that right
> now,
> cassandra will deserialize whole rows during compaction
> (http://wiki.apache.org/cassandra/CassandraLimitations).
> So depending on the size of what you store in you column, you could
> very well hit
> that limitation (that could be why you OOM). In which case, I see two
> choices:
> 1) add more RAM to the machine or 2) change your data structure to
> avoid that (maybe
> can you split rows with too many columns somehow ?).
>

Splitting the rows would be an option if we got anything near decent speed
for small rows, but even if we only have a few hundred thousand columns in
one row, the read speed is still slow.

What kind of numbers are common for this type of operation? Say that you
have a row with 500000 columns whose names range from 0x0 to 0x7A120, and
you do get_slice operations on that with ranges of random numbers in the
interval but with a fixed count of 1000, and that you multithread it with
~10 of threads, can't you get more than 50 reads/s?

When we've been reading up on Cassandra we've seen posts that billions of
columns in a row shouldn't be a problem, and sure enough, writing all that
data goes pretty fast, but as soon as you want to retrieve it, it is really
slow. We also tried doing counts on the number of columns in a row, and that
was really, really slow, it took half a minute to count the columns in a row
with 500000 columns, and when doing the same on a row with millions, it just
crashed with an OOM exception after a few minutes.


/Henrik

Mime
View raw message