El lun, 11-10-2010 a las 11:08 -0400, Edward Capriolo escribió: Inlined: > 2010/10/11 Héctor Izquierdo Seliva : > > Hi everyone. > > > > I'm sure this question or similar has come up before, but I can't find a > > clear answer. I have to store a unknown number of items in cassandra, > > which can vary from a few hundreds to a few millions per customer. > > > > I read that in cassandra wide rows are better than a lot of rows, but > > then I face two problems. First, column distribution. The only way I can > > think of distributing items among a given set of rows is hashing the > > item id to a row id, and the using the item id as the column name. In > > this way, I can distribute data among a few rows evenly, but If there > > are only a few items it's equivalent to a row per item plus more > > overhead, and if there are millions of items then the rows are to big, > > and I have to turn off row cache. Does anybody knows a way around this? > > > > The second issue is that in my benchmarks, once the data is mmapped, one > > item per row performs faster than wide rows by a significant margin. Is > > this how it is supposed to be? > > > > I can give additional data if needed. English is not my first language > > so I apologize beforehand is some of this doesn't make sense. > > > > Thanks for your time > > > > > If you have wide rows RowCache is a problem. IMHO RowCache is only > viable in situations where you have a fixed amount of data and thus > will get a high hit rate. I was running a large row cache for some > time and I found it unpredictable. It causes memory pressure on the > JVM from moving things in and out of memory, and if the hit rate is > low taking a key and all its columns in and out repeatedly ends up > being counter productive for disk utilization. Suggest KeyCache in > most situations, (there is a ticket opened for a fractional row cache) I saw the same behavior. It's a pity there is not a column cache. That would be awesome. > Another factor to consider is if you have many rows and many columns > you end up with large (er) indexes. In our case we have start up times > slightly longer then we would like because the process of sampling > indexes during start up is intensive. If I could do it all over again > I might serialize more into single columns rather then exploding data > across multiple rows and columns. If you always need to look up the > entire row do not break it down by columns. So it might be better to store a json serialized version then? I was using SuperColumns to store item info, but a simple string might give me the option to do some compression. > memory mapping. There are different dynamics depending on data size > relative to memory size. You may have something like ~ 40GB of data > and 10GB index, 32GB RAM a node, this system is not going to respond > the same way with say 200GB data 25 GB Indexes. Also it is very > workload dependent. We have a 6 node cluster with 16 GB RAM each, although the whole dataset is expected to be around 100GB per machine. Which indexes are more expensive, row or column indexes? > Hope this helps, > Edward It does!