Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3A15018F35 for ; Sat, 30 May 2015 02:28:18 +0000 (UTC) Received: (qmail 96709 invoked by uid 500); 30 May 2015 02:28:17 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 96668 invoked by uid 500); 30 May 2015 02:28:17 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 96656 invoked by uid 99); 30 May 2015 02:28:17 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 May 2015 02:28:17 +0000 Date: Sat, 30 May 2015 02:28:17 +0000 (UTC) From: "Jeff Jirsa (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-8340) Use sstable min timestamp when deciding if an sstable should be included in DTCS compactions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Jirsa updated CASSANDRA-8340: ---------------------------------- Attachment: 8340.diff We have a use case where we're unable to rely on the max timestamp because it CAN BE (in some partitions, but not all) modified to be pathologically high (but never pathologically low), causing the existing filtering code to work as designed, but not in a way that benefits us. While we could write a custom DTCS-like replacement, this proposed improvement has a nice side effect of also solving our problem, so I've written a simple patch to enable it OPTIONALLY for users who may prefer to filter on {{sstable.getMinTimestamp()}} rather than {{sstable.getMaxTimestamp()}} The proposed syntax is fairly straight forward. Standard behavior is unchanged: {noformat} create table test (id text primary key) with compaction={ 'max_sstable_age_days': '1', 'base_time_seconds': '3600', 'min_threshold': '4', 'class': 'DateTieredCompactionStrategy' } ; {noformat} Optional min timestamp is available with a boolean flag: {noformat} create table test (id text primary key) with compaction={ 'max_sstable_age_days': '1', 'base_time_seconds': '3600', 'min_threshold': '4', 'class': 'DateTieredCompactionStrategy', 'use_min_timestamp' : 'true' } ; {noformat} This decreases risk for existing users, but adds functionality for people who are migrating from STCS and/or have other reasons to prefer getMinTimestamp(). What do you guys think, [~krummas] and [~Bj0rn] ? > Use sstable min timestamp when deciding if an sstable should be included in DTCS compactions > -------------------------------------------------------------------------------------------- > > Key: CASSANDRA-8340 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8340 > Project: Cassandra > Issue Type: Improvement > Reporter: Marcus Eriksson > Priority: Minor > Attachments: 8340.diff > > > Currently we check how old the newest data (max timestamp) in an sstable is when we check if it should be compacted. > If we instead switch to using min timestamp for this we have a pretty clean migration path from STCS/LCS to DTCS. > My thinking is that before migrating, the user does a major compaction, which creates a huge sstable containing all data, with min timestamp very far back in time, then switching to DTCS, we will have a big sstable that we never compact (ie, min timestamp of this big sstable is before max_sstable_age_days), and all newer data will be after that, and that new data will be properly compacted > WDYT [~Bj0rn] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)