incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kołaczkowski <>
Subject Re: Ticket CASSANDRA-3578 - Multithreaded CommitLog
Date Thu, 08 Dec 2011 07:33:34 GMT

Right, this would be the best option to have an ability to write into 
multiple log files, put on multiple disks. I'm not sure if it is part of 
that ticket, though. Maybe we should split it into two things: parallel 
serialization / CRC and parallel writes to multiple logfiles (as another 
ticket). Looks like a major commitlog refactoring, including touching 
the logfile segment management and logfile recovery code.

BTW: I'm not so sure if multiple, parallel writes to a memory mapped 
file would be actually slower or faster than sequential writes. I think 
the OS would optimise the writes so that physically they would be 
sequential, or even delay them until fsync (or low cached disk buffers), 
so no performance loss would occur, while moving data from temporary 
array to shared buffer memory would be actually faster (and possibility 
of avoiding temporary arrays by serializing directly into the shared 
buffer at all is also  promising). I think we should benchmark / profile 
this first (I can do it) and see how it is in reality, unless someone 
has already done that. If you are interested, I can find some time today 
evening to do it.

W dniu 2011-12-07 21:57, Jeremiah Jordan pisze:
> Another option is to have multiple threads reading from the queue and 
> writing to their own commit log files.  If you have multiple commit 
> log directories with each having its own task writing to it, you can 
> keep the "only sequential writes" optimization.  Multiple writers to 
> one disk only makes sense if you are using a SSD for storage, other 
> wise you don't only have have sequential writes, which would slow down 
> the writing.
> On 12/07/2011 10:56 AM, Piotr Kołaczkowski wrote:
>> Hello everyone,
>> As an interview task I've got to make CommitLog multithreaded. I'm 
>> new to Cassandra project and therefore, before I start modifying 
>> code, I have to make sure I understand what is going on there correctly.
>> Feel free to correct anything I got wrong or partially wrong.
>> 1. The CommitLog singleton object is responsible for receiving 
>> RowMutation objects by its add method. The add method is thread-safe 
>> and is aimed to be called by many threads adding their RowMutations 
>> independently.
>> 2. Each invocation of CommitLog#add  puts a new task onto the queue. 
>> This task is represented by LogRecordAdder callable object, which is 
>> responsible for actually calling the CommitLogSegment#write method 
>> for doing all the "hard work" of serializing the RowMutation object, 
>> calculating CRC and writing that to the memory mapped 
>> CommitLogSegment file buffer. The add method immediately returns a 
>> Future object, which can be waited for (if needed) - it will block 
>> until the row mutation is saved to the log file and (optionally) synced.
>> 3. The queued tasks are processed one-by-one, sequentially by the 
>> appropriate ICommitLogExecutorService. This service also controls 
>> syncing the active memory mapped segments. There are two sync 
>> strategies available: periodic and batched. The periodic simply calls 
>> sync periodically by asynchronously putting appropriate sync task 
>> into the queue, inbetween the LogRecordAdder tasks. The 
>> LogRecordAdder tasks are "done" as soon as they are written to the 
>> log, so the caller *won't wait* for the sync. On the other hand, the 
>> batched strategy (BatchCommitLogExecutorService), performs the tasks 
>> in batches, each batch finished with an sync operation. The tasks are 
>> marked as done *after* the sync operation is finished. This deferred 
>> task marking  is achieved thanks to CheaterFutureTask class - 
>> allowing to run the task without immediately marking FutureTask as 
>> done. Nice. :)
>> 4. The serialized size of the RowMutation object is calculated twice: 
>> once before submitting to the ExecutorService - to detect if it is 
>> not larger than the segment size, and then after being taken from the 
>> queue for execution - to check if it fits into the active 
>> CommitLogSegment, and if it doesn't, to activate a new 
>> CommitLogSegment. Looks to me like a point needing optimisation. I 
>> couldn't find any code for caching the serialized size to avoid doing 
>> it twice.
>> 5. The serialization, CRC calculation and actual commit log writes 
>> are happening sequentially. The aim of this ticket is to make it 
>> parallel.
>> Questions:
>> 1. What happens to the recovery, if the power goes off before the log 
>> has been synced, and it has been written partially (e.g. it is 
>> truncated in the middle of the RowMutation data)? Are incomplete 
>> RowMutation writes detected only by means of CRC (CommitLog around 
>> lines 237-240), or is there some other mechanism for it?
>> 2. Is the CommitLog#add method allowed to do some heavier 
>> computations? What is the contract for it? Does it have to return 
>> immediately or can I move some code into it?
>> Solutions I consider (please comment):
>> 1. Moving the serialized size calculation, serialization and CRC 
>> calculation totally before the executor service queue, so that these 
>> operations would be parallel, and performed once per RowMutation 
>> object. The calculated size / data array / CRC value would be 
>> appended to the task and put into the queue. Then copying that into 
>> the commit log would proceed sequentially - the task would contain 
>> only code for log writing. This is the safest and easiest solution, 
>> but also the least performant, because copying is still sequential 
>> and still might be a bottleneck. The logic of allocating new commit 
>> log segments and syncing remains unchanged.
>> 2. Moving the serialized size calculation, serialization, CRC 
>> calculation *and commit log writing* before the executor service 
>> queue. This raises immediately some problems / questions:
>> a) The code for segment allocation needs to be changed, as it becomes 
>> multithreaded. It can be done using AtomicInteger.compareAndSet, so 
>> that each RowMutation gets its own, non-overlapping piece of commit 
>> log to write into.
>> b) What happens if there is not enough free space in the current 
>> active segment? Do we allow more active segments at once? Or do we 
>> restrict the parallelism to writing just into a single active segment 
>> (I don't like it, as it would be for certain less performant, because 
>> we would have to wait for finishing the current active segement, 
>> before we can start a new one)?
>> c) Is the recovery method ready for reading partially written 
>> (invalid) RowMutation, that is not the last mutation in the commit 
>> log? If we allow writing several row mutations parallel, it has to be.
>> d) The tasks are sent to the queue only for wait-for-sync 
>> functionality - they would not contain any code to execute, because 
>> everything would be already done.
>> 3. Everything just as 2., but with an addition, that the 
>> serialization code writes directly into the target memory mapped 
>> buffer and not into a temporary byte array. This would save us 
>> copying and also put less strain on GC.
>> Sorry, for such a long e-mail and best regards,
>> Piotr Kolaczkowski

Piotr Kołaczkowski
Instytut Informatyki, Politechnika Warszawska
Nowowiejska 15/19, 00-665 Warszawa

View raw message