cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wojciech Meler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2901) Allow taking advantage of multiple cores while compacting a single CF
Date Fri, 15 Jul 2011 10:52:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065851#comment-13065851
] 

Wojciech Meler commented on CASSANDRA-2901:
-------------------------------------------

Maybe it would be nice to spawn separate compaction process? 
It is quite GC-intensive operation, so maybe it make sense to separate it from server?
It would also be nice to have cli tool to compact files without cassandra server for backup
purpose - why not spawn such tool from server?

> Allow taking advantage of multiple cores while compacting a single CF
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-2901
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2901
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> Moved from CASSANDRA-1876:
> There are five stages: read, deserialize, merge, serialize, and write. We probably want
to continue doing read+deserialize and serialize+write together, or you waste a lot copying
to/from buffers.
> So, what I would suggest is: one thread per input sstable doing read + deserialize (a
row at a time). One thread merging corresponding rows from each input sstable. One thread
doing serialize + writing the output. This should give us between 2x and 3x speedup (depending
how much doing the merge on another thread than write saves us).
> This will require roughly 2x the memory, to allow the reader threads to work ahead of
the merge stage. (I.e. for each input sstable you will have up to one row in a queue waiting
to be merged, and the reader thread working on the next.) Seems quite reasonable on that front.
> Multithreaded compaction should be either on or off. It doesn't make sense to try to
do things halfway (by doing the reads with a
> threadpool whose size you can grow/shrink, for instance): we still have compaction threads
tuned to low priority, by default, so the impact on the rest of the system won't be very different.
Nor do we expect to have so many input sstables that we lose a lot in context switching between
reader threads. (If this is a concern, we already have a tunable to limit the number of sstables
merged at a time in a single CF.)
> IMO it's acceptable to punt completely on rows that are larger than memory, and fall
back to the old non-parallel code there. I don't see any sane way to parallelize large-row
compactions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message