Yes, some sort of data structure to coordinate this could reduce the problem as well. 
I made some comments on that at the end of 2558.

I believe a coordinator could be in place both to 
- plan the start of compaction 
and 
- to coordinate compaction thread shutdown and tmp file deletion before we completely run out of disk space

Regards, 
Terje

On Wed, May 4, 2011 at 10:09 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
Or we could "reserve" space when starting a compaction.

On Wed, May 4, 2011 at 2:32 AM, Terje Marthinussen
<tmarthinussen@gmail.com> wrote:
> Partially, I guess this may be a side effect of multithreaded compactions?
> Before running out of space completely, I do see a few of these:
>  WARN [CompactionExecutor:448] 2011-05-02 01:08:10,480
> CompactionManager.java (line 516) insufficient space to compact all
> requested files SSTableReader(path='/data/cassandra/JP_MALL_P
> H/Order-f-12858-Data.db'),
> SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12851-Data.db'),
> SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12864-Data.db')
>  INFO [CompactionExecutor:448] 2011-05-02 01:08:10,481 StorageService.java
> (line 2066) requesting GC to free disk space
> In this case, there would be 24 threads that asked if there was empty disk
> space.
> Most of them probably succeeded in that request, but they could have
> requested 24x available space in theory since I do not think there is any
> global pool of used disk in place that manages which how much disk space
> will be needed for already started compactions?
> Of course, regardless how much checking there is in advance, we could still
> run out of disk, so I guess there is also a need for checking if diskspace
> is about to run out while compaction runs so things may be halted/aborted.
> Unfortunately that would need global coordination so we do not stop all
> compaction threads....
> After reducing to 6 compaction threads in 0.8 beta2, the data has compacted
> just fine without any disk space issues, so I guess another problem you may
> hit as you get a lot of sstables which have updates (that is, duplicates) to
> the same data, is that of course, the massively concurrent compaction taking
> place with nproc threads could also concurrently duplicate all the
> duplicates on a large scale.
> Yes, this is in favour of multithreaded compaction as it should normally
> help keeping sstables to a sane level and avoid such problems, but it is
> unfortunately just a kludge to the real problem which is to segment the
> sstables somehow on keyspace so we can get down the disk requirements and
> recover from scenarios where disk gets above 50%.
> Regards,
> Terje
>
>
> On Wed, May 4, 2011 at 3:33 PM, Terje Marthinussen <tmarthinussen@gmail.com>
> wrote:
>>
>> Well, just did not look at these logs very well at all last night
>> First out of disk message:
>> ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
>> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
>> Thread[CompactionExecutor:387,1,main]
>> java.io.IOException: No space left on device
>> Then finally the last one
>> ERROR [FlushWriter:128] 2011-05-02 01:51:06,112
>> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
>> Thread[FlushWriter:128,5,main]
>> java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
>> space to flush 554962 bytes
>>         at
>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>         at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.lang.RuntimeException: Insufficient disk space to flush
>> 554962 bytes
>>         at
>> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
>>         at
>> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
>>         at
>> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
>>         at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
>>         at
>> org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
>>         at
>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>>         ... 3 more
>>  INFO [CompactionExecutor:451] 2011-05-02 01:51:06,113 StorageService.java
>> (line 2066) requesting GC to free disk space
>> [lots of sstables deleted]
>> After this is starts running again (although not fine it seems).
>> So the disk seems to have been full for 35 minutes due to un-deleted
>> sstables.
>> Terje
>> On Wed, May 4, 2011 at 6:34 AM, Terje Marthinussen
>> <tmarthinussen@gmail.com> wrote:
>>>
>>> Hm... peculiar.
>>> Post flush is not involved in compactions, right?
>>>
>>> May 2nd
>>> 01:06 - Out of disk
>>> 01:51 - Starts a mix of major and minor compactions on different column
>>> families
>>> It then starts a few minor compactions extra over the day, but given that
>>> there are more than 1000 sstables, and we are talking 3 minor compactions
>>> started, it is not normal I think.
>>> May 3rd 1 minor compaction started.
>>> When I checked today, there was a bunch of tmp files on the disk with
>>> last modify time from 01:something on may 2nd and 200GB empty disk...
>>> Definitely no compaction going on.
>>> Guess I will add some debug logging and see if I get lucky and run out of
>>> disk again.
>>> Terje
>>> On Wed, May 4, 2011 at 5:06 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>>>
>>>> Compaction does, but flush didn't until
>>>> https://issues.apache.org/jira/browse/CASSANDRA-2404
>>>>
>>>> On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen
>>>> <tmarthinussen@gmail.com> wrote:
>>>> > Yes, I realize that.
>>>> > I am bit curious why it ran out of disk, or rather, why I have 200GB
>>>> > empty
>>>> > disk now, but unfortunately it seems like we may not have had
>>>> > monitoring
>>>> > enabled on this node to tell me what happened in terms of disk usage.
>>>> > I also thought that compaction was supposed to resume (try again with
>>>> > less
>>>> > data) if it fails?
>>>> > Terje
>>>> >
>>>> > On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis <jbellis@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> post flusher is responsible for updating commitlog header after a
>>>> >> flush; each task waits for a specific flush to complete, then does
>>>> >> its
>>>> >> thing.
>>>> >>
>>>> >> so when you had a flush catastrophically fail, its corresponding
>>>> >> post-flush task will be stuck.
>>>> >>
>>>> >> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
>>>> >> <tmarthinussen@gmail.com> wrote:
>>>> >> > Just some very tiny amount of writes in the background here (some
>>>> >> > hints
>>>> >> > spooled up on another node slowly coming in).
>>>> >> > No new data.
>>>> >> >
>>>> >> > I thought there was no exceptions, but I did not look far enough
>>>> >> > back in
>>>> >> > the
>>>> >> > log at first.
>>>> >> > Going back a bit further now however, I see that about 50 hours
>>>> >> > ago:
>>>> >> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
>>>> >> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread
>>>> >> > Thread[CompactionExecutor:387,1,main]
>>>> >> > java.io.IOException: No space left on device
>>>> >> >         at java.io.RandomAccessFile.writeBytes(Native Method)
>>>> >> >         at
>>>> >> > java.io.RandomAccessFile.write(RandomAccessFile.java:466)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
>>>> >> >         at
>>>> >> >
>>>> >> > org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
>>>> >> >         at
>>>> >> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>> >> >         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>> >> >         at java.lang.Thread.run(Thread.java:662)
>>>> >> > [followed by a few more of those...]
>>>> >> > and then a bunch of these:
>>>> >> > ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
>>>> >> > AbstractCassandraDaemon.java
>>>> >> > (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
>>>> >> > java.lang.RuntimeException: java.lang.RuntimeException:
>>>> >> > Insufficient
>>>> >> > disk
>>>> >> > space to flush 40009184 bytes
>>>> >> >         at
>>>> >> >
>>>> >> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>> >> >         at java.lang.Thread.run(Thread.java:662)
>>>> >> > Caused by: java.lang.RuntimeException: Insufficient disk space to
>>>> >> > flush
>>>> >> > 40009184 bytes
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
>>>> >> >         at
>>>> >> >
>>>> >> >
>>>> >> > org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
>>>> >> >         at
>>>> >> >
>>>> >> > org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
>>>> >> >         at
>>>> >> > org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
>>>> >> >         at
>>>> >> > org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
>>>> >> >         at
>>>> >> >
>>>> >> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>>>> >> >         ... 3 more
>>>> >> > Seems like compactions stopped after this (a bunch of tmp tables
>>>> >> > there
>>>> >> > still
>>>> >> > from when those errors where generated), and I can only suspect the
>>>> >> > post
>>>> >> > flusher may have stopped at the same time.
>>>> >> > There is 890GB of disk for data, sstables are currently using 604G
>>>> >> > (139GB is
>>>> >> > old tmp tables from when it ran out of disk) and "ring" tells me
>>>> >> > the
>>>> >> > load on
>>>> >> > the node is 313GB.
>>>> >> > Terje
>>>> >> >
>>>> >> >
>>>> >> > On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis <jbellis@gmail.com>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> ... and are there any exceptions in the log?
>>>> >> >>
>>>> >> >> On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis <jbellis@gmail.com>
>>>> >> >> wrote:
>>>> >> >> > Does it resolve down to 0 eventually if you stop doing writes?
>>>> >> >> >
>>>> >> >> > On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
>>>> >> >> > <tmarthinussen@gmail.com> wrote:
>>>> >> >> >> Cassandra 0.8 beta trunk from about 1 week ago:
>>>> >> >> >> Pool Name                    Active   Pending      Completed
>>>> >> >> >> ReadStage                         0         0              5
>>>> >> >> >> RequestResponseStage              0         0          87129
>>>> >> >> >> MutationStage                     0         0         187298
>>>> >> >> >> ReadRepairStage                   0         0              0
>>>> >> >> >> ReplicateOnWriteStage             0         0              0
>>>> >> >> >> GossipStage                       0         0        1353524
>>>> >> >> >> AntiEntropyStage                  0         0              0
>>>> >> >> >> MigrationStage                    0         0             10
>>>> >> >> >> MemtablePostFlusher               1       190            108
>>>> >> >> >> StreamStage                       0         0              0
>>>> >> >> >> FlushWriter                       0         0            302
>>>> >> >> >> FILEUTILS-DELETE-POOL             0         0             26
>>>> >> >> >> MiscStage                         0         0              0
>>>> >> >> >> FlushSorter                       0         0              0
>>>> >> >> >> InternalResponseStage             0         0              0
>>>> >> >> >> HintedHandoff                     1         4              7
>>>> >> >> >>
>>>> >> >> >> Anyone with nice theories about the pending value on the
>>>> >> >> >> memtable
>>>> >> >> >> post
>>>> >> >> >> flusher?
>>>> >> >> >> Regards,
>>>> >> >> >> Terje
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Jonathan Ellis
>>>> >> >> > Project Chair, Apache Cassandra
>>>> >> >> > co-founder of DataStax, the source for professional Cassandra
>>>> >> >> > support
>>>> >> >> > http://www.datastax.com
>>>> >> >> >
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> --
>>>> >> >> Jonathan Ellis
>>>> >> >> Project Chair, Apache Cassandra
>>>> >> >> co-founder of DataStax, the source for professional Cassandra
>>>> >> >> support
>>>> >> >> http://www.datastax.com
>>>> >> >
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Jonathan Ellis
>>>> >> Project Chair, Apache Cassandra
>>>> >> co-founder of DataStax, the source for professional Cassandra support
>>>> >> http://www.datastax.com
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan Ellis
>>>> Project Chair, Apache Cassandra
>>>> co-founder of DataStax, the source for professional Cassandra support
>>>> http://www.datastax.com
>>>
>>
>
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com