cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Wille <rwi...@fold3.com>
Subject Pros and cons of lots of very small partitions versus fewer larger partitions
Date Fri, 05 Dec 2014 17:14:26 GMT
At the data modeling class at the Cassandra Summit, the instructor said that lots of small
partitions are just fine. I’ve heard on this list that that is not true, and that its better
to cluster small partitions into fewer, larger partitions. Due to conflicting information
on this issue, I’d be interested in hearing people’s opinions.

For the sake of discussion, lets compare two tables:

CREATE TABLE a (
id INT,
value INT,
PRIMARY KEY (id)
)

CREATE TABLE b (
bucket INT,
id INT,
value INT,
PRIMARY KEY ((bucket), id)
)

And lets say that bucket is computed as id / N. For analysis purposes, lets assume I have
100 million id’s to store.

Table a is obviously going to have a larger bloom filter. That’s a clear negative.

When I request a record, table a will have less data to load from disk, so that seems like
a positive.

Table a will never have its columns scattered across multiple SSTables, but table b might.
If I only want one row from a partition in table b, does fragmentation matter (I think probably
not, but I’m not sure)?

It’s not clear to me which will fit more efficiently on disk, but I would guess that table
a wins.

Smaller partitions means sending less data during repair, but I suspect that when computing
the Merkle tree for the table, more partitions might mean more overhead, but that’s only
a guess. Which one repairs more efficiently?

In your opinion, which one is best and why? If you think table b is best, what would you choose
N to be?

Robert


Mime
View raw message