incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Wide rows or tons of rows?
Date Mon, 11 Oct 2010 15:08:50 GMT
2010/10/11 H├ęctor Izquierdo Seliva <izquierdo@strands.com>:
> Hi everyone.
>
> I'm sure this question or similar has come up before, but I can't find a
> clear answer. I have to store a unknown number of items in cassandra,
> which can vary from a few hundreds to a few millions per customer.
>
> I read that in cassandra wide rows are better than a lot of rows, but
> then I face two problems. First, column distribution. The only way I can
> think of distributing items among a given set of rows is hashing the
> item id to a row id, and the using the item id as the column name. In
> this way, I can distribute data among a few rows evenly, but If there
> are only a few items it's equivalent to a row per item plus more
> overhead, and if there are millions of items then the rows are to big,
> and I have to turn off row cache. Does anybody knows a way around this?
>
> The second issue is that in my benchmarks, once the data is mmapped, one
> item per row performs faster than wide rows by a significant margin. Is
> this how it is supposed to be?
>
> I can give additional data if needed. English is not my first language
> so I apologize beforehand is some of this doesn't make sense.
>
> Thanks for your time
>
>
If you have wide rows RowCache is a problem. IMHO RowCache is only
viable in situations where you have a fixed amount of data and thus
will get a high hit rate. I was running a large row cache for some
time and I found it unpredictable. It causes memory pressure on the
JVM from moving things in and out of memory, and if the hit rate is
low taking a key and all its columns in and out repeatedly ends up
being counter productive for disk utilization. Suggest KeyCache in
most situations, (there is a ticket opened for a fractional row cache)

Another factor to consider is if you have many rows and many columns
you end up with large (er) indexes. In our case we have start up times
slightly longer then we would like because the process of sampling
indexes during start up is intensive. If I could do it all over again
I might serialize more into single columns rather then exploding data
across multiple rows and columns. If you always need to look up the
entire row do not break it down by columns.

memory mapping. There are different dynamics depending on data size
relative to memory size. You may have something like ~ 40GB of data
and 10GB index, 32GB RAM a node, this system is not going to respond
the same way with say 200GB data 25 GB Indexes. Also it is very
workload dependent.

Hope this helps,
Edward

Mime
View raw message