cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleg Anastasyev (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CASSANDRA-1658) support incremental sstable switching
Date Wed, 24 Nov 2010 08:14:15 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935246#action_12935246
] 

Oleg Anastasyev commented on CASSANDRA-1658:
--------------------------------------------

I'd like to propose slightly modified approach to switch-over: it could be better to switch
- over reads incrementally to new sstable, not after compaction completed, but during compaction.
Taking into account that compaction writes rows in ordered by token fashion, it could be easily
determined was a row with exact key already written to new sstable or not. As soon as row
is written to the new sstable, it will be never changed, so it could be read from new sstable
like any other normal row.

Ok, I agree, this implementation is more complex, but it gives a number of advantages:
# We dont need to keep storage occupied after compaction is completed.
# Just written row resides in buffer cache with higher probability, so reads from it are 1)
cheaper and 2) prevent OS from purging hot read blocks of new sstable from buffer cache.
# In addition, if we limit the speed of compaction (say, no more than 10% of disk io utilisation),
we can avoid disk spikes completely without even employing direct io writes approach. My reasoning
is to have constantly low load spread over time is much better for overall system stability,
than have eventual spikes of disk activity AND read duration latencies. So, ideally, compaction
with incremental switch over should be tuned to run slow and continously: as soon as one compaction
ends, another is starting.


> support incremental sstable switching
> -------------------------------------
>
>                 Key: CASSANDRA-1658
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1658
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Priority: Minor
>
> I have been thinking about how to minimize the impact of compaction further beyond CASSANDRA-1470.
1470 deals with the impact of the compaction process itself in that it avoids going through
the buffer cache; however, once compaction is complete you are still switching to new sstables
which will imply cold reads.
> Instead of switching all at once, one could keep both the old and new sstables around
for a bit and incrementally switch over traffic to the new sstables.
> A given request would go to the new or old sstable depending on e.g. the hash of the
row key couple with the point in time relative to compaction completion and relative to the
intended target sstable switch-over.
> In terms of end-user configuration/mnemonics, one would specify, for a given column family,
something like "sstable transition period per gb of data" or similar. The "per gb of data"
would refer to the size of the newly written sstable after a compaction. So; for a major compaction
you would wait for a very significant period of time since the entire database just went cold.
For a minor compaction, you would only wait for a short period of time.
> The result should be a reasonable negative impact on e.g. disk space usage, but hopefully
a very significant impact in terms of making the sstable transition as smooth as possible
for the node.
> I like this because it feels pretty simple, is not relying on OS specific features or
otherwise rely on specific support from the OS other than a "well functioning cache mechanism",
and does not imply something hugely significant like writing our own page cache layer. The
performance w.r.t. CPU should be very small, but the improvement in terms of disk I/O should
be very significant for workloads where it matters.
> The feature would be optional and per-sstable (or possibly global for the node).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message