incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@yakaz.com>
Subject Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'
Date Tue, 09 Mar 2010 19:14:03 GMT
Alright,

What I'm observing shows better with bigger columns, so I've slightly modified
the stress.py test so that it inserts column of 50K bytes (I attach
the modified stress.py
for info but it really just read 50000 bytes from /dev/null and use
that as data.
I also added a sleep to the insert otherwise cassandra dies on the
insertion :)).

I'm also using 0.6-beta2 from the cassandra website. And I've given 1.5G of RAM
to Cassandra just in case.

I've inserted 1000 row of 100 column each (python stress.py -t 2 -n
1000 -c 100 -i 5)
If I read, I get the roughly the same number of row whether I read the whole row
(python stress.py -t 10 -n 1000 -o read -r -c 100) or only the first column
(python stress.py -t 10 -n 1000 -o read -r -c 1). And that's less that
10 rows by
seconds.

So sure, when I read the whole row, that almost 1000 columns by
seconds, which is
roughly 50M/s troughput, which is quite good. But when I read only the
first column,
I get 10 columns by seconds, that 500K/s, which is less good. Now,
from what I've
understood so far, cassandra doesn't deserialize whole row to read a
single column
(I'm not using supercolumn here), so I don't understand those numbers.

Plus if I insert the same data but 'inlining' everything, that is
100000 rows of 1 column,
then I get read performances of around 400 columns by seconds.
Does that mean that I should put columns in the same row only if every
request will read
at least 40 columns at a time ?

Just to explain why I'm doing such test, let me quickly explain what
I'm trying to do.
I need to store images that are geographically localized. When I
request them, I
request 5 to 10 of those images that are geographically close. My idea
is to have
row keys that are some id of a delimited region and column names that
are the actual
geographic position of the image (the column values are the images data). Each
region (row) will have from 10 to around 10000 image (column) max and
getting my 5-10
images geographically close just amount to a get_slice.
But when I do that, I have bad read performances (4-5 row/sec, that is
50 images max by
seconds and less than that on average). I get better performances by
putting one image by
row. And it makes me really sad as it makes me use cassandra as a
basic key/value store
without using the free sorting. And I want my free sorting :(

Thanks in advance for any explanation/help.

Cheers,
Sylvain

On Tue, Mar 9, 2010 at 3:34 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
> On Tue, Mar 9, 2010 at 8:31 AM, Sylvain Lebresne <sylvain@yakaz.com> wrote:
>> Well, unless I'm mistaking, that's the same in my example as I give in
>> both case
>> to stress.py the option '-c 1' which tells it to retrieve only one
>> column each time
>> even in the case where I have 100 columns by row.
>
> Oh.
>
> Why would you do that? :)
>

Mime
View raw message