incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donald Smith <>
Subject RE: Question about how compaction and partition keys interact
Date Wed, 26 Mar 2014 18:54:04 GMT
My underlying question is about the effects of the partitioning key on compaction.   Specifically,
would having date as part of the partitioning key make compaction easier (because compaction
wouldn't have to merge wide rows over multiple days)?   According to the person on irc, it
wouldn't make much difference.

We care mostly about read times. If read times were all we cared about, we'd use a CQL primary
key  of ((customer_id,type) date), especially since it lets us efficiently iterate over all
dates for a given customer and type.  I also care about compaction time, and if the other
primary key form decreased compaction time, I might go for it. We have terabytes of data.

I don't think we ever have to query all types for a given customer or date.  That is, we are
always given a specific customer and type, plus usually but not always a date.

Thanks, Don

From: Jonathan Lacefield []
Sent: Wednesday, March 26, 2014 11:20 AM
Subject: Re: Question about how compaction and partition keys interact


  What is the underlying question?  Are trying to figure out what's going to be faster for
reads or are you really concerned about storage?

  The recommendation typically provided is to suggest that tables are modeled based on query
access, to enable the fastest read performance.

  In your example, will your app's queries look for
  1)  customer interactions by type by day, with the ability to
           - sort by day within a type
           - grab ranges of dates for at type quickly
           - or pull all dates (and cell data) for a type
 2)  customer interactions by date by type, with the ability to
           - sort by type within a date
           - grab ranges of types for a date quickly
           - or pull all types data for a date

  We also typically recommend that partitions stay within ~100k of columns or ~100MB per partition.
 With your first scenario, wide row, you wouldn't hit the number of columns for ~273 years

  What's interesting in your modeling scenario is that, with the current options, you don't
have the ability to easily pull all dates for a customer without specifying the type, specific
dates, or using ALLOW FILTERING.  Did you ever consider partitioning simply on customer and
using date and type as clustering keys?

  Hope that helps.


Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
[Image removed by sender.]<>

[Image removed by sender.]<>

On Wed, Mar 26, 2014 at 1:22 PM, Donald Smith <<>>
In CQL we need to decide between using ((customer_id,type),date) as the CQL primary key for
a reporting table, versus ((customer_id,date),type).

We store reports for every day.  If we use (customer_id,type) as the partition key (physical
key), then we have  a WIDE ROW where each date's data is stored in a different column. Over
time, as new reports are added for different dates, the row will get wider and wider, and
I thought that might cause more work for compaction.

So, would a partition key of (customer_id,date) yield better compaction behavior?

Again, if we use (customer_id,type) as the partition key, then over time, as new columns are
added to that row for different dates, I'd think that compaction would have to merge new data
for a given physical row from multiple sstables. That would make compaction expensive.  But
if we use (customer_id,date) as the partition key, then new data will be added to new physical
rows, and so compaction would have less work to do????

My question is really about how compaction interacts with partition keys.  Someone on the
Cassandra irc channel,, said that when partition
keys overlap between sstables, there's only "slightly" more work to do than when they don't,
for merging sstables in compaction.  So he thought the first form,  ((customer_id,type),date),
 would be better.

One advantage of the first form, ((customer_id,type),date) ,  is that we can get all report
data for all dates for a given customer and type in a single wide row  -- and we do have a
(uncommon) use case for such reports.

If we used a primary key of ((customer_id,type,date)), then the rows would be un-wide; that
wouldn't take advantage of clustering columns and (like the second form) wouldn't support
the (uncommon) use case mentioned in the previous paragraph.

Thanks, Don

Donald A. Smith | Senior Software Engineer
P: 425.201.3900 x 3866<tel:425.201.3900%20x%203866>
C: (206) 819-5965<tel:%28206%29%20819-5965>
F: (646) 443-2333<tel:%28646%29%20443-2333><>


View raw message