You should endeavor to use a repeatable method of segmenting your data.
Swapping partitions every time you "fill one" seems like an anti pattern to
me. but I suppose it really depends on what your primary key is. Can you
share some more information on this?
In the past I have utilized the consistent hash method you described (add
an artificial row key segment by modulo some part of the clustering key by
a fixed position count) combined with a lazy evaluation cursor.
The lazy evaluation cursor essentially is set up to query X number of
partitions simultaneously, but to execute those queries only add needed to
fill the page size. To perform paging you have to know the last primary key
that was returned so you can use that to limit the next iteration.
You can trade latency for additional work load by controlling the number of
concurrent executions you do as the iterating occurs. Or you can minimize
the work on your cluster by querying each partition one at a time.
Unfortunately due to the artificial partition key segment you cannot
iterate or page in any particular order...(at least across partitions)
Unless your hash function can also provide you some ordering guarantees.
It all just depends on your requirements.
Clint
On Jan 4, 2016 10:13 AM, "Jim Ancona" <jim@anconafamily.com> wrote:
> A problem that I have run into repeatedly when doing schema design is how
> to control partition size while still allowing for efficient multirow
> queries.
>
> We want to limit partition size to some number between 10 and 100
> megabytes to avoid operational issues. The standard way to do that is to
> figure out the maximum number of rows that your "natural partition key"
> will ever need to support and then add an additional artificial partition
> key that segments the rows sufficiently to get keep the partition size
> under the maximum. In the case of time series data, this is often done by
> bucketing by time period, i.e. creating a new partition every minute, hour
> or day. For nontime series data by doing something like
> Hash(clusteringkey) mod desirednumberofpartitions.
>
> In my case, multirow queries to support a REST API typically return a
> page of results, where the page size might be anywhere from a few dozen up
> to thousands. For query efficiency I want the average number of rows per
> partition to be large enough that a query can be satisfied by reading a
> small number of partitionsideally one.
>
> So I want to simultaneously limit the maximum number of rows per partition
> and yet maintain a large enough average number of rows per partition to
> make my queries efficient. But with my data the ratio between maximum and
> average can be very large (up to four orders of magnitude).
>
> Here is an example:
>
>
> Rows per Partition
>
> Partition Size
>
> Mode
>
> 1
>
> 1 KB
>
> Median
>
> 500
>
> 500 KB
>
> 90th percentile
>
> 5,000
>
> 5 MB
>
> 99th percentile
>
> 50,000
>
> 50 MB
>
> Maximum
>
> 2,500,000
>
> 2.5 GB
>
> In this case, 99% of my data could fit in a single 50 MB partition. But if
> I use the standard approach, I have to split my partitions into 50 pieces
> to accommodate the largest data. That means that to query the 700 rows for
> my median case, I have to read 50 partitions instead of one.
>
> If you try to deal with this by starting a new partition when an old one
> fills up, you have a nasty distributed consensus problem, along with
> readbeforewrite. Cassandra LWT wasn't available the last time I dealt
> with this, but might help with the consensus part today. But there are
> still some nasty corner cases.
>
> I have some thoughts on other ways to solve this, but they all have
> drawbacks. So I thought I'd ask here and hope that someone has a better
> approach.
>
> Thanks in advance,
>
> Jim
>
>
