incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kołaczkowski <pkola...@ii.pw.edu.pl>
Subject Re: Ticket CASSANDRA-3578 - Multithreaded CommitLog
Date Thu, 08 Dec 2011 16:44:48 GMT

W dniu 2011-12-08 08:40, Jonathan Ellis pisze:
> 2011/12/8 Piotr Kołaczkowski<pkolaczk@ii.pw.edu.pl>:
>> Right, this would be the best option to have an ability to write into
>> multiple log files, put on multiple disks. I'm not sure if it is part of
>> that ticket, though.
> It's not.  I don't think anyone needs more than 80MB/s or so of
> commitlog bandwidth for a while.
>
>> BTW: I'm not so sure if multiple, parallel writes to a memory mapped file
>> would be actually slower or faster than sequential writes. I think the OS
>> would optimise the writes so that physically they would be sequential, or
>> even delay them until fsync (or low cached disk buffers), so no performance
>> loss would occur
> Right.  What we're trying to fix here is having a single thread doing
> the copying + checksumming being a bottleneck.  The i/o pattern should
> stay more or less the same.
>

Thanks for explanation. This is exactly what I understood from the 
ticket. Also calculating the serialized size twice looks like a waste of 
CPU to me (or am I wrong and it is calculated once?)
Now, the longer I think about this ticket, I've got more questions.

Can someone tell me what is the use pattern of the CommitLog#add method? 
I mean, is it possible, that a single thread calls add many times, 
remembers the returned Future objects and *then* waits on all / some of 
them? Or is it always like: add, then wait (until the Future is ready), 
add, wait, add, wait... ? If the former is true, then we would benefit 
from returning the Future objects as early as possible, without 
performing any heavy calculations in the add method, and making the code 
parallel on the output of the queue - by using some kind of a thread 
pool executor (or changing current commit log executors to have more 
than one worker thread). Then, even if a single thread writes to the 
CommitLog many RowMutations, the CRC and copying would be still parallel 
and fast. What do you think of it? Does it make sense? In the future, 
such architecture could be extended to supporting many log files on 
separate disks :)


To summarize:

The current architecture:
many threads (calc. size) ->  queue -> one thread (calc. size, 
serialize, CRC, allocate, copy, fsync)

My 1st proposal:
many threads (calc. size, serialize, CRC) -> queue -> one thread 
(allocate, copy, fsync)

My 2nd proposal:
many threads (calc. size, allocate, serialize, CRC, copy) -> queue -> 
one thread (fsync)

My 3rd proposal:
many threads (calc. size, allocate, serialize directly into buffer, CRC) 
-> queue -> one thread (fsync)

My 4th proposal:
many threads (no op) -> queue -> n threads, where n = number of cores  
(calc. size, allocate, serialize, CRC, copy) -> queue -> one thread (fsync)

Which one do you like the most?

-- 
Piotr Kołaczkowski
Instytut Informatyki, Politechnika Warszawska
Nowowiejska 15/19, 00-665 Warszawa
e-mail: pkolaczk@ii.pw.edu.pl
www: http://home.elka.pw.edu.pl/~pkolaczk/


Mime
View raw message