directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Selcuk AYA <>
Subject Re: [Txn Layer] WAL flush questions
Date Mon, 19 Mar 2012 17:59:09 GMT
On Mon, Mar 19, 2012 at 10:41 AM, Emmanuel Lécharny <> wrote:
> Le 3/19/12 6:26 PM, Selcuk AYA a écrit :
>> On Mon, Mar 19, 2012 at 9:24 AM, Emmanuel Lécharny<>
>>  wrote:
>>> Hi,
>>> I have a few questions about the handling of the log buffer.
>>> When we can't write anymore data in the buffer, because it's full, we try
>>> to
>>> flush the buffer on disk. What happens then is :
>>> - if there is enough room remaining in the buffer, we write a skip record
>>> (with a -1 length) : is it necessary ? (we then rewind the buffer)
>>> - otherwise, we rewind the buffer
>>> In any case, we increment the writeAheadRewindCount : what for ?
>>> then we call the flush() method, which will be executed only if there is
>>> no
>>> other thread flushing the buffer already (just in case the sync() method
>>> is
>>> called by another thread). I guess this is intended to allow a thread to
>>> add
>>> new data in the buffer while another thread writes the buffer on disk?
>>> So AFAIU, only one thread will be allowed to write data into the buffer,
>>> up
>>> to the point it reaches a record being hold by the flush thread, and only
>>> one thread can flush the data, up to the point it reaches the last record
>>> it
>>> can write (which is computed before the flush() method is called).
>>> I'm wondering if we couldn't use a simpler algorithm, where we have a
>>> flush
>>> thread used to flush the data in any case. If the buffer is full, we stop
>>> writing until we are signaled that there is some room left (and this is
>>> the
>>> flush thread role to signal the writer that it can start again). That
>>> means
>>> we write as much as we can, signaling each record to the flush thread,
>>> and
>>> the flush thread will consume the record when they arrive. If both are
>>> colliding (ie, no more room remains in the buffer, the reader will have
>>> to
>>> wait for the writer to wake it up). We won't need to use a buffer at all,
>>> we
>>> just pass the records (plus their headers and trailers) in  queue,
>>> avoiding
>>> a copy in a temporary memory.
>>> This is basically doing the same thing, but we don't wait until the
>>> buffer
>>> is full to wake up the writer. This is the way the network layer works in
>>> NIO, with a selector signaling the writer thread when it's ready to
>>> accept
>>> some more data to be written.
>> I am confused about the buffering (or no buffering) you suggest. Are
>> you suggesting a flush thread will use directly write off the user's
>> buffer without any in mem copy?
> Yes. In fact, I suggest we buffer the records, without copying them. When
> the flush thread is waken up (or kicked), it will write the header, the
> buffer, the  footer. We can use ByteBuffer gathering for that (see

I see.But this is effectively what we are doing right? Instead of
putting the buffers in a queue and doing scatter/gather through byte
buffer(which will eventually do a memcpy to do a single batched write
I think), we copy into an in mem buffer and let the flushing thread to
do the single batched write.

>> Currently the things work like this on the common code path:
>> * for user threads:
>> prepare record
>> get log latch
>> copy in memory buffer and get LSN(logicla sequence number).
>> release log latch
>> return LSN
>> *for background flushing thread:
>> wake up periodically , reap the in memory log and write
>> so background does not necessarily wait for buffer to be full to
>> wakeup and write.In the hopefully less common case, if the buffer is
>> full, a user thread will take it for the team and write the buffer(we
>> could signal the flush thread as an alternative here).
>> In the common case, this allows user threads not wait for write and
>> getting an LSN quickly(LSN is important to order log records) and
>> batching of writes. Similar algorithms are used for all database WAL
>> code I looked at(including Apache Derby)
> I have something different in mind to get the record ordered : inject them
> in a queue (as only one single writer will access the queue, the order will
> be guaranteed). The flush thread will be waiting on this queue to be
> modified to flush the data on disk. This queue can contain a limited number
> of records, and we can check if that the record size does not exceed a
> certain amount.
> In any case, the flush thread is autonomous, and can either be wakened up
> when the queue has some data, or wait to be wakened up periodically, of when
> the queue is full.
> Does it makes sense ?
> Note : I'm not suggesting that we should change the current code, just
> trying to get some thougth food for later improvement...
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny

View raw message