incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Bradberry <rbradbe...@gmail.com>
Subject Re: Customized Compaction Strategy: Dev Questions
Date Wed, 04 Jun 2014 17:40:33 GMT
Maybe I’m misunderstanding something, but what makes you think that running a major compaction
every day will cause they data from January 1st to exist in only one SSTable and not have
data from other days in the SSTable as well? Are you talking about making a new compaction
strategy that creates SSTables by day?



On June 4, 2014 at 1:36:10 PM, Redmumba (redmumba@gmail.com) wrote:

Let's say I run a major compaction every day, so that the "oldest" sstable contains only the
data for January 1st.  Assuming all the nodes are in-sync and have had at least one repair
run before the table is dropped (so that all information for that time period is "the same"),
wouldn't it be safe to assume that the same data would be dropped on all nodes?  There might
be a period when the compaction is running where different nodes might have an inconsistent
view of just that days' data (in that some would have it and others would not), but the cluster
would still function and become eventually consistent, correct?

Also, if the entirety of the sstable is being dropped, wouldn't the tombstones be removed
with it?  I wouldn't be concerned with individual rows and columns, and this is a write-only
table, more or less--the only deletes that occur in the current system are to delete the old
data.


On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry <rbradberry@gmail.com> wrote:
I’m not sure what you want to do is feasible.  At a high level I can see you running into
issues with RF etc.  The SSTables node to node are not identical, so if you drop a full SSTable
on one node there is no one corresponding SSTable on the adjacent nodes to drop.    You
would need to choose data to compact out, and ensure it is removed on all replicas as well.
 But if your problem is that you’re low on disk space then you probably won’t be able
to write out a new SSTable with the older information compacted out. Also, there is more to
an SSTable than just data, the SSTable could have tombstones and other relics that haven’t
been cleaned up from nodes coming or going. 




On June 4, 2014 at 1:10:58 PM, Redmumba (redmumba@gmail.com) wrote:

Thanks, Russell--yes, a similar concept, just applied to sstables.  I'm assuming this would
require changes to both major compactions, and probably GC (to remove the old tables), but
since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible
with the current toolset before I actually dived in and started tinkering.

Andrew


On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry <rbradberry@gmail.com> wrote:
hmm, I see. So something similar to Capped Collections in MongoDB.



On June 4, 2014 at 1:03:46 PM, Redmumba (redmumba@gmail.com) wrote:

Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply
run out of space.

The problem with using TTLs is that I have to try and guess how much data is being put in--since
this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing,
etc..  I'd like to maximize the disk space--not optimize the cleanup process.

Andrew


On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry <rbradberry@gmail.com> wrote:
You mean this:

https://issues.apache.org/jira/browse/CASSANDRA-5228

?



On June 4, 2014 at 12:42:33 PM, Redmumba (redmumba@gmail.com) wrote:

Good morning!

I've asked (and seen other people ask) about the ability to drop old sstables, basically creating
a FIFO-like clean-up process.  Since we're using Cassandra as an auditing system, this is
particularly appealing to us because it means we can maximize the amount of auditing data
we can keep while still allowing Cassandra to clear old data automatically.

My idea is this: perform compaction based on the range of dates available in the sstable (or
just metadata about when it was created).  For example, a major compaction could create a
combined sstable per day--so that, say, 60 days of data after a major compaction would contain
60 sstables.

My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? 
Does this sound feasilble at all?  Based on the implementation of Size and Leveled strategies,
it looks like I would have the ability to control what and how things get compacted, but I
wanted to verify before putting time into it.

Thank you so much for your time!

Andrew




Mime
View raw message