cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyler Hobbs <>
Subject Re: Pros and cons of lots of very small partitions versus fewer larger partitions
Date Fri, 05 Dec 2014 17:33:18 GMT
On Fri, Dec 5, 2014 at 11:14 AM, Robert Wille <> wrote:

>  And lets say that bucket is computed as id / N. For analysis purposes,
> lets assume I have 100 million id’s to store.
>  Table a is obviously going to have a larger bloom filter. That’s a clear
> negative.

That's true, *but*, if you frequently ask for rows that do not exist, Table
B will have few BF false positives, while Table A will almost always get a
"hit" from the BF and have to look into the SSTable to see that the
requested row doesn't actually exist.

>  When I request a record, table a will have less data to load from disk,
> so that seems like a positive.


>  Table a will never have its columns scattered across multiple SSTables,
> but table b might. If I only want one row from a partition in table b, does
> fragmentation matter (I think probably not, but I’m not sure)?

Yes, fragmentation can matter.  Cassandra knows the min and max clustering
column values for each SSTable, so it can use those to narrow down the set
of SSTables it needs to read if you request a specific clustering column
value.  However, in your example, this isn't likely to narrow things down
much, so it will have to check many more SSTables.

>  It’s not clear to me which will fit more efficiently on disk, but I
> would guess that table a wins.

They're probably close enough not to matter very much.

>  Smaller partitions means sending less data during repair, but I suspect
> that when computing the Merkle tree for the table, more partitions might
> mean more overhead, but that’s only a guess. Which one repairs more
> efficiently?

Table A repairs more efficiently by far.  Currently repair must repair
entire partitions when they differ.  It cannot repair individual rows
within a partition.

>  In your opinion, which one is best and why? If you think table b is
> best, what would you choose N to be?

Table A, hands down.  Here's why: you should model your tables to fit your
queries.  If you're doing a basic K/V lookup, model it like table A.
People recommend wide partitions because many (if not most) queries are
best served by that type of model, so if you're not using wide partitions,
it's a sign that something might be wrong.  However, there are certainly
plenty of use cases where single-row partitions are fine.

Tyler Hobbs
DataStax <>

View raw message