directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <>
Subject Re: [Txn Layer] WAL flush questions
Date Mon, 19 Mar 2012 22:13:32 GMT
Le 3/19/12 7:38 PM, Selcuk AYA a écrit :
> On Mon, Mar 19, 2012 at 11:32 AM, Emmanuel Lécharny<>  wrote:
>> Le 3/19/12 6:59 PM, Selcuk AYA a écrit :
>>> On Mon, Mar 19, 2012 at 10:41 AM, Emmanuel Lécharny<>
>>>   wrote:
>>>> Le 3/19/12 6:26 PM, Selcuk AYA a écrit :
>>>>> On Mon, Mar 19, 2012 at 9:24 AM, Emmanuel Lécharny<>
>>>>>   wrote:
>>>>>> Hi,
>>>>>> I have a few questions about the handling of the log buffer.
>>>>>> When we can't write anymore data in the buffer, because it's full,
>>>>>> try
>>>>>> to
>>>>>> flush the buffer on disk. What happens then is :
>>>>>> - if there is enough room remaining in the buffer, we write a skip
>>>>>> record
>>>>>> (with a -1 length) : is it necessary ? (we then rewind the buffer)
>>>>>> - otherwise, we rewind the buffer
>>>>>> In any case, we increment the writeAheadRewindCount : what for ?
>>>>>> then we call the flush() method, which will be executed only if there
>>>>>> is
>>>>>> no
>>>>>> other thread flushing the buffer already (just in case the sync()
>>>>>> method
>>>>>> is
>>>>>> called by another thread). I guess this is intended to allow a thread
>>>>>> to
>>>>>> add
>>>>>> new data in the buffer while another thread writes the buffer on
>>>>>> So AFAIU, only one thread will be allowed to write data into the
>>>>>> buffer,
>>>>>> up
>>>>>> to the point it reaches a record being hold by the flush thread,
>>>>>> only
>>>>>> one thread can flush the data, up to the point it reaches the last
>>>>>> record
>>>>>> it
>>>>>> can write (which is computed before the flush() method is called).
>>>>>> I'm wondering if we couldn't use a simpler algorithm, where we have
>>>>>> flush
>>>>>> thread used to flush the data in any case. If the buffer is full,
>>>>>> stop
>>>>>> writing until we are signaled that there is some room left (and this
>>>>>> the
>>>>>> flush thread role to signal the writer that it can start again).
>>>>>> means
>>>>>> we write as much as we can, signaling each record to the flush thread,
>>>>>> and
>>>>>> the flush thread will consume the record when they arrive. If both
>>>>>> colliding (ie, no more room remains in the buffer, the reader will
>>>>>> to
>>>>>> wait for the writer to wake it up). We won't need to use a buffer
>>>>>> all,
>>>>>> we
>>>>>> just pass the records (plus their headers and trailers) in  queue,
>>>>>> avoiding
>>>>>> a copy in a temporary memory.
>>>>>> This is basically doing the same thing, but we don't wait until the
>>>>>> buffer
>>>>>> is full to wake up the writer. This is the way the network layer
>>>>>> in
>>>>>> NIO, with a selector signaling the writer thread when it's ready
>>>>>> accept
>>>>>> some more data to be written.
>>>>> I am confused about the buffering (or no buffering) you suggest. Are
>>>>> you suggesting a flush thread will use directly write off the user's
>>>>> buffer without any in mem copy?
>>>> Yes. In fact, I suggest we buffer the records, without copying them. When
>>>> the flush thread is waken up (or kicked), it will write the header, the
>>>> buffer, the  footer. We can use ByteBuffer gathering for that (see
>>> I see.But this is effectively what we are doing right? Instead of
>>> putting the buffers in a queue and doing scatter/gather through byte
>>> buffer(which will eventually do a memcpy to do a single batched write
>>> I think), we copy into an in mem buffer and let the flushing thread to
>>> do the single batched write.
>> Yes, but you copy the user records into a temporary ByteBuffer, which will
>> be read and flushed. If you put the user records in a queue, you don't need
>> this extra copy, plus you don't need to allocate a 4Mb buffer at all. That
>> does not mean you won't suck those 4 Mb, if the queue is not emptied fast
>> enough by the flush thread, but in the general case, you just end using less
>> memory if the flush thread is awakened when some data is present in the
>> queue.
> So we want to write to the end of log a batched write using a "single"
> IO. What I am saying this wont the java byte buffer implementation
> have to internally copy the buffers into a single buffer and do a
> single batched write from that buffer?
As soon as the userRecord is already written into a ByteBuffer, there is 
no need to copy it into another buffer. We simply use a 
FileChannel.write(ByteBuffer[], offset, length) to write the buffers on 
disk. Here, the ByteBuffer[] will contain the header, the userRecord and 
the footer. Internally, the operation will use DirectBuffer, instead of 
HeapBuffers, and we have no control over the write.

The thing is to avoid doing an extra buffer copy.

Using MemoryMappedFile can also speed up the thing as soon as we can 
drain the queue faster (as the data will be written in memory instead of 
being flushed on disk). of course, if the computer does not have enough 
memory, we will still be slowed down and the queue will grow... There is 
no black magic here, we can just rely on what Java offers instead of 
redifining everything.

Also keep in mind that Derby and such other software were written with 
Java 1.3 in mind. All those MemoryMappedFile and other FileChannel 
weren't around before Java 1.4, and the base of Derby was alrady written.

One last thing : all those suggestion *must* be evaluated. Until they 
are compared with something that *works*, we don't have a baseline. What 
you are currently building will become the base line.

Thanks !

Emmanuel Lécharny

View raw message