incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan McCall <n...@vervewireless.com>
Subject Re: Range scan performance in 0.6.0 beta2
Date Thu, 25 Mar 2010 16:40:23 GMT
I noticed you turned Key caching off in your ColumnFamily declaration,
have you tried experimenting with this on and playing key caching
configuration? Also, have you looked at the JMX output for what
commands are pending execution? That is always helpful to me in
hunting down bottlenecks.

-Nate

On Thu, Mar 25, 2010 at 9:31 AM, Henrik Schröder <skrolle@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 15:17, Sylvain Lebresne <sylvain@yakaz.com> wrote:
>>
>> I don't know If that could play any role, but if ever you have
>> disabled the assertions
>> when running cassandra (that is, you removed the -ea line in
>> cassandra.in.sh), there
>> was a bug in 0.6beta2 that will make read in row with lots of columns
>> quite slow.
>
> We tried it with beta3 and got the same results, so that didn't do anything.
>
>>
>> Another problem you may have is if you have the commitLog directory on the
>> same
>> hard drive than the data directory. If that's the case and you read
>> and write at the
>> same time, that may be a reason for poor read performances (and write
>> too).
>
> We also tested doing only reads, and got about the same read speeds
>
>>
>> As for the row with 30 millions columns, you have to be aware that right
>> now,
>> cassandra will deserialize whole rows during compaction
>> (http://wiki.apache.org/cassandra/CassandraLimitations).
>> So depending on the size of what you store in you column, you could
>> very well hit
>> that limitation (that could be why you OOM). In which case, I see two
>> choices:
>> 1) add more RAM to the machine or 2) change your data structure to
>> avoid that (maybe
>> can you split rows with too many columns somehow ?).
>
> Splitting the rows would be an option if we got anything near decent speed
> for small rows, but even if we only have a few hundred thousand columns in
> one row, the read speed is still slow.
>
> What kind of numbers are common for this type of operation? Say that you
> have a row with 500000 columns whose names range from 0x0 to 0x7A120, and
> you do get_slice operations on that with ranges of random numbers in the
> interval but with a fixed count of 1000, and that you multithread it with
> ~10 of threads, can't you get more than 50 reads/s?
>
> When we've been reading up on Cassandra we've seen posts that billions of
> columns in a row shouldn't be a problem, and sure enough, writing all that
> data goes pretty fast, but as soon as you want to retrieve it, it is really
> slow. We also tried doing counts on the number of columns in a row, and that
> was really, really slow, it took half a minute to count the columns in a row
> with 500000 columns, and when doing the same on a row with millions, it just
> crashed with an OOM exception after a few minutes.
>
>
> /Henrik
>

Mime
View raw message