cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vegard Berget" <>
Subject Number of rows under one partition key
Date Thu, 15 May 2014 13:10:13 GMT
I know this has been discussed before, and I know there are
limitations to how many rows one partition key in practice can handle.
 But I am not sure if number of rows or total data is the deciding
factor.  I know the thrift interface well, but this is my first
project where we are actively using cql, so this is also new for me.
The case is like this:We have a partition key, clientid, which have a
cluster key (id).Number of rows with the same clientid normally is
between 10 000 and 100 000 I would guess.  The data is pretty small,
let's say 200 bytes per row in average (probably even smaller, but for
the example let's assume 200 bytes).It _CAN_ however be more rows with
the same clientid in some edge cases, I would guess up to 1 000
000. Most of the time we read with both id and clientid, or we read
for example 1000 rows with just clientid.  It would be nice to be
able to fetch all rows in one query, if possible.Number of reads
versus number of writes is about 100 to 1.  Of that I guess that
updates versus inserts is about 1:4.  Deletes are rare.Currently the
production environment is Cassandra 1.2.11, but we are testing this on
Cassandra 2.0.something in our development environment.
Questions:Should we add another partition key to avoid 1 000 000 rows
in the same thrift-row (which is how I understand it is actually
stored)?  Or is 1 000 000 rows okay?  If we add a "bucketid"-ish
thing to the partition key, how should we do queries most
effectively?Since reading is the most important, and writing and space
is not an issue, should we have a high number of replications and read
from (relatively) few nodes?  When it comes to consistency, it isn't
a problem waiting for everything to be replicated to responsive nodes
(within some ms or even seconds), but if a node goes down and contains
very old data (multiple minutes, hours or days) - that would be a
problem, atleast if it happened regulary..  What, in practice, is the
cost of reading with a high number of nodes in the consistency level.
   Does replicate to 4 nodes, read from 2 sound like an ok option
here (avoiding full consistency, but at the same time if one node
crashes and comes up with old data we still would get a pretty
consistent result.  The probability of 2 of the nodes crashing at the
same time is low, and _maybe_ something we can live with in this
specific case)?
Other considerations, for example compaction strategy and if we should
do an upgrade to 2.0 because of this (we will upgrade anyway, but if
it is recommended we will continue to use 2.0 in development and
upgrade the production environment sooner)
I have done some testing, inserting a million rows and selecting them
all, counting them and selecting individual rows (with both clientid
and id) and it seems fine, but I want to ask to be sure that I am on
the right track.  
Best regards,Vegard Berget

View raw message