cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Héctor Izquierdo Seliva <izquie...@strands.com>
Subject Re: Wide rows or tons of rows?
Date Mon, 11 Oct 2010 15:46:19 GMT
El lun, 11-10-2010 a las 11:08 -0400, Edward Capriolo escribió:

Inlined:

> 2010/10/11 Héctor Izquierdo Seliva <izquierdo@strands.com>:
> > Hi everyone.
> >
> > I'm sure this question or similar has come up before, but I can't find a
> > clear answer. I have to store a unknown number of items in cassandra,
> > which can vary from a few hundreds to a few millions per customer.
> >
> > I read that in cassandra wide rows are better than a lot of rows, but
> > then I face two problems. First, column distribution. The only way I can
> > think of distributing items among a given set of rows is hashing the
> > item id to a row id, and the using the item id as the column name. In
> > this way, I can distribute data among a few rows evenly, but If there
> > are only a few items it's equivalent to a row per item plus more
> > overhead, and if there are millions of items then the rows are to big,
> > and I have to turn off row cache. Does anybody knows a way around this?
> >
> > The second issue is that in my benchmarks, once the data is mmapped, one
> > item per row performs faster than wide rows by a significant margin. Is
> > this how it is supposed to be?
> >
> > I can give additional data if needed. English is not my first language
> > so I apologize beforehand is some of this doesn't make sense.
> >
> > Thanks for your time
> >
> >
> If you have wide rows RowCache is a problem. IMHO RowCache is only
> viable in situations where you have a fixed amount of data and thus
> will get a high hit rate. I was running a large row cache for some
> time and I found it unpredictable. It causes memory pressure on the
> JVM from moving things in and out of memory, and if the hit rate is
> low taking a key and all its columns in and out repeatedly ends up
> being counter productive for disk utilization. Suggest KeyCache in
> most situations, (there is a ticket opened for a fractional row cache)

I saw the same behavior. It's a pity there is not a column cache. That
would be awesome.

> Another factor to consider is if you have many rows and many columns
> you end up with large (er) indexes. In our case we have start up times
> slightly longer then we would like because the process of sampling
> indexes during start up is intensive. If I could do it all over again
> I might serialize more into single columns rather then exploding data
> across multiple rows and columns. If you always need to look up the
> entire row do not break it down by columns.

So it might be better to store a json serialized version then? I was
using SuperColumns to store item info, but a simple string might give me
the option to do some compression.

> memory mapping. There are different dynamics depending on data size
> relative to memory size. You may have something like ~ 40GB of data
> and 10GB index, 32GB RAM a node, this system is not going to respond
> the same way with say 200GB data 25 GB Indexes. Also it is very
> workload dependent.

We have a 6 node cluster with 16 GB RAM  each, although the whole
dataset is expected to be around 100GB per machine. Which indexes are
more expensive, row or column indexes?

> Hope this helps,
> Edward

It does!



Mime
View raw message