Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6F8A71085C for ; Tue, 28 Jan 2014 15:51:40 +0000 (UTC) Received: (qmail 56737 invoked by uid 500); 28 Jan 2014 15:51:40 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 56523 invoked by uid 500); 28 Jan 2014 15:51:39 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 56512 invoked by uid 99); 28 Jan 2014 15:51:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 15:51:39 +0000 Date: Tue, 28 Jan 2014 15:51:39 +0000 (UTC) From: "Jonathan Ellis (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-6474) Compaction strategy based on MinHash MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884266#comment-13884266 ] Jonathan Ellis commented on CASSANDRA-6474: ------------------------------------------- Over in 6216 I said {quote} we should prefer smallest write amplification where wa = (overlapped size-minus-tombstones + candidate size-minus-tombstones) / candidate size-minus-tombstones {quote} With HLL we could actually compute write amplification directly instead of guessing based on size. > Compaction strategy based on MinHash > ------------------------------------ > > Key: CASSANDRA-6474 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6474 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Yuki Morishita > Assignee: sankalp kohli > Labels: compaction > Fix For: 3.0 > > > We can consider an SSTable as a set of partition keys, and 'compaction' as de-duplication of those partition keys. > We want to find compaction candidates from SSTables that have as many same keys as possible. If we can group similar SSTables based on some measurement, we can achieve more efficient compaction. > One such measurement is [Jaccard Distance|http://en.wikipedia.org/wiki/Jaccard_index], > !http://upload.wikimedia.org/math/1/8/6/186c7f4e83da32e889d606140fae25a0.png! > which we can estimate using technique called [MinHash|http://en.wikipedia.org/wiki/MinHash]. > In Cassandra, we can calculate and store MinHash signature when writing SSTable. New compaction strategy uses the signature to find the group of similar SSTable for compaction candidates. We can always fall back to STCS when such candidates are not exists. > This is just an idea floating around my head, but before I forget, I dump it here. For introduction to this technique, [Chapter 3 of 'Mining of Massive Datasets'|http://infolab.stanford.edu/~ullman/mmds/ch3.pdf] is a good start. -- This message was sent by Atlassian JIRA (v6.1.5#6160)