cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Ancona <>
Subject Re: Data Modeling: Partition Size and Query Efficiency
Date Tue, 05 Jan 2016 16:07:55 GMT
Thanks for responding!

My natural partition key is a customer id. Our customers have widely
varying amounts of data. Since the vast majority of them have data that's
small enough to fit in a single partition, I'd like to avoid imposing
unnecessary overhead on the 99% just to avoid issues with the largest 1%.

The approach to querying across multiple partitions you describe is pretty
much what I have in mind. The trick is to avoid having to query 50
partitions to return a few hundred or thousand rows.

I agree that sequentially filling partitions is something to avoid. That's
why I'm hoping someone can suggest a good alternative.


On Mon, Jan 4, 2016 at 8:07 PM, Clint Martin <> wrote:

> You should endeavor to use a repeatable method of segmenting your data.
> Swapping partitions every time you "fill one" seems like an anti pattern to
> me. but I suppose it really depends on what your primary key is. Can you
> share some more information on this?
> In the past I have utilized the consistent hash method you described (add
> an artificial row key segment by modulo some part of the clustering key by
> a fixed position count) combined with a lazy evaluation cursor.
> The lazy evaluation cursor essentially is set up to query X number of
> partitions simultaneously, but to execute those queries only add needed to
> fill the page size. To perform paging you have to know the last primary key
> that was returned so you can use that to limit the next iteration.
> You can trade latency for additional work load by controlling the number
> of concurrent executions you do as the iterating occurs. Or you can
> minimize the work on your cluster by querying each partition one at a time.
> Unfortunately due to the artificial partition key segment you cannot
> iterate or page in any particular order...(at least across partitions)
> Unless your hash function can also provide you some ordering guarantees.
> It all just depends on your requirements.
> Clint
> On Jan 4, 2016 10:13 AM, "Jim Ancona" <> wrote:
>> A problem that I have run into repeatedly when doing schema design is how
>> to control partition size while still allowing for efficient multi-row
>> queries.
>> We want to limit partition size to some number between 10 and 100
>> megabytes to avoid operational issues. The standard way to do that is to
>> figure out the maximum number of rows that your "natural partition key"
>> will ever need to support and then add an additional artificial partition
>> key that segments the rows sufficiently to get keep the partition size
>> under the maximum. In the case of time series data, this is often done by
>> bucketing by time period, i.e. creating a new partition every minute, hour
>> or day. For non-time series data by doing something like
>> Hash(clustering-key) mod desired-number-of-partitions.
>> In my case, multi-row queries to support a REST API typically return a
>> page of results, where the page size might be anywhere from a few dozen up
>> to thousands. For query efficiency I want the average number of rows per
>> partition to be large enough that a query can be satisfied by reading a
>> small number of partitions--ideally one.
>> So I want to simultaneously limit the maximum number of rows per
>> partition and yet maintain a large enough average number of rows per
>> partition to make my queries efficient. But with my data the ratio between
>> maximum and average can be very large (up to four orders of magnitude).
>> Here is an example:
>> Rows per Partition
>> Partition Size
>> Mode
>> 1
>> 1 KB
>> Median
>> 500
>> 500 KB
>> 90th percentile
>> 5,000
>> 5 MB
>> 99th percentile
>> 50,000
>> 50 MB
>> Maximum
>> 2,500,000
>> 2.5 GB
>> In this case, 99% of my data could fit in a single 50 MB partition. But
>> if I use the standard approach, I have to split my partitions into 50
>> pieces to accommodate the largest data. That means that to query the 700
>> rows for my median case, I have to read 50 partitions instead of one.
>> If you try to deal with this by starting a new partition when an old one
>> fills up, you have a nasty distributed consensus problem, along with
>> read-before-write. Cassandra LWT wasn't available the last time I dealt
>> with this, but might help with the consensus part today. But there are
>> still some nasty corner cases.
>> I have some thoughts on other ways to solve this, but they all have
>> drawbacks. So I thought I'd ask here and hope that someone has a better
>> approach.
>> Thanks in advance,
>> Jim

View raw message