cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6474) Compaction strategy based on MinHash
Date Tue, 28 Jan 2014 02:02:41 GMT


Benedict commented on CASSANDRA-6474:

[~yukim] It sounds like you're only aiming to select the best compaction candidates, rather
than potentially avoiding compacting files that can be said to be (mostly) non-intersecting.
Wouldn't the latter be even more useful, especially for use cases where data is mostly appending
to unique PartitionKeys?

Also, it seems for this we could supplement whatever data we retain for the initial calculation
with a sampling of one of the files (which, given Murmur3 gives a good spread of data could
probably be optimised to scanning a percentage of pages, rather than random records) to give
tighter bounds on the actual overlap, in the case we decide not to compact two files for this

> Compaction strategy based on MinHash
> ------------------------------------
>                 Key: CASSANDRA-6474
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Yuki Morishita
>            Assignee: sankalp kohli
>              Labels: compaction
>             Fix For: 3.0
> We can consider an SSTable as a set of partition keys, and 'compaction' as de-duplication
of those partition keys.
> We want to find compaction candidates from SSTables that have as many same keys as possible.
If we can group similar SSTables based on some measurement, we can achieve more efficient
> One such measurement is [Jaccard Distance|],
> !!
> which we can estimate using technique called [MinHash|].
> In Cassandra, we can calculate and store MinHash signature when writing SSTable. New
compaction strategy uses the signature to find the group of similar SSTable for compaction
candidates. We can always fall back to STCS when such candidates are not exists.
> This is just an idea floating around my head, but before I forget, I dump it here. For
introduction to this technique, [Chapter 3 of 'Mining of Massive Datasets'|]
is a good start.

This message was sent by Atlassian JIRA

View raw message