cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Question about how compaction and partition keys interact
Date Thu, 27 Mar 2014 16:24:54 GMT
If the number of types and dates per customer are reasonable modest (dozens? hundreds?) it
may not matter much at all. What are the numbers here, average/maximum types per customer
and dates per customer? In fact, depending on the numbers, maybe the partition key should
only be the customer. I mean, Cassandra is good for reasonably wide partitions, and exploding
the number of partitions might not be playing to Cassandra’s strengths in any meaningful
way. IT might be different if you were talking about many thousands of types or dates per
customer, but... are you?

But ultimately it does come down to how you will be accessing the data – query, view, update.

-- Jack Krupansky

From: Donald Smith 
Sent: Wednesday, March 26, 2014 1:22 PM
To: mailto:user@cassandra.apache.org 
Subject: Question about how compaction and partition keys interact

In CQL we need to decide between using ((customer_id,type),date) as the CQL primary key for
a reporting table, versus ((customer_id,date),type).

 

We store reports for every day.  If we use (customer_id,type) as the partition key (physical
key), then we have  a WIDE ROW where each date's data is stored in a different column. Over
time, as new reports are added for different dates, the row will get wider and wider, and
I thought that might cause more work for compaction.

 

So, would a partition key of (customer_id,date) yield better compaction behavior?  

 

Again, if we use (customer_id,type) as the partition key, then over time, as new columns are
added to that row for different dates, I’d think that compaction would have to merge new
data for a given physical row from multiple sstables. That would make compaction expensive.
 But if we use (customer_id,date) as the partition key, then new data will be added to new
physical rows, and so compaction would have less work to do????

 

My question is really about how compaction interacts with partition keys.  Someone on the
Cassandra irc channel, http://webchat.freenode.net/?channels=#cassandra, said that when partition
keys overlap between sstables, there’s only “slightly” more work to do than when they
don’t, for merging sstables in compaction.  So he thought the first form, ((customer_id,type),date),
would be better.

 

One advantage of the first form, ((customer_id,type),date) , is that we can get all report
data for all dates for a given customer and type in a single wide row  -- and we do have a
(uncommon) use case for such reports. 

 

If we used a primary key of ((customer_id,type,date)), then the rows would be un-wide; that
wouldn’t take advantage of clustering columns and (like the second form) wouldn’t support
the (uncommon) use case mentioned in the previous paragraph.

 

Thanks, Don

 

Donald A. Smith | Senior Software Engineer 
P: 425.201.3900 x 3866
C: (206) 819-5965
F: (646) 443-2333
donalds@AudienceScience.com




 

Mime
View raw message