db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh Thalamati <suresh.thalam...@gmail.com>
Subject Re: [jira] Commented: (DERBY-96) partial log record writes that occur because of out-of order writes need to be handled by recovery.
Date Fri, 11 Feb 2005 20:40:49 GMT

By looking at the  log buffer code 
(org.apache.derby.impl.store.raw.log.LogAccessFile.java) to implement 
the checksum logic.
What I learned is ,  cheskum log record can not be simply written into 
the beginning of the buffer in all cases. 
The reason for this is a log record  does  not  always fit into a single 
buffer with the current default buffer size(32k).
With  32k data page size ,  log record size can be 2*32K + some 
overhead..  Current logic solves  this problem  by
writing the   log records into more then one buffer  by  breaking the 
log record
 into logical units.(length , instant, log record info  , optional 
data),  if a unit of write does not fit into a single buffer,  it is 
written directly to the log file.

possible ways to write  checksum log record  seems to be :
1) Instead of having fixed size log buffers , increase the log buffer 
size to the size of the log record dynamically ,   when a  log  record 
can not be fit into a single  buffer.
    Maximum size of the buffer will    2*32K +  FEW BYTES OF OVERHEAD. , 
In worst case , instead of  three 32k buffer  there will be 3  ~96K 
    In this approach  log record will always fits into a single buffer 
,  checksum log record can   be written into the beginning of the buffer.
    One issue  with this approach is an extra memory copy is required  
to copy the log records into log buffers,  which is avoided currently by 
doing direct write to the log file.

2) Instead of writing the cheskum for each buffer,  calculate checksum 
for a group of   buffers when they are being  written to the disk and 
write checksum log record
    before writing the  log buffers contents.

Any comments/suggestions ?


Suresh Thalamati (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/DERBY-96?page=comments#action_59041 ]
>Suresh Thalamati commented on DERBY-96:
>Conclusion was to solve this problem by writing a checksum log record before writing 
the log buffer and verify the checksum
>during recovery. 
>I don't know how to link derby dev list e-mail to zira. just
>doing  copy/paste of comments from e-mail list. 
>Mike Matrigali wrote:
>>>I think that some fix to this issue should be implemented for the next
>>>release.  The order of my preference is #2, #1, #3.
>I believe option #2 (checksuming log recods in the log buffers before
>writing to the disk)  is a good fix for this problem.
>If there are no objectiions to this approach,  I will start to work on
>>>I think that the option #2 can be implemented in the logging system and
>>>require very little if no changes to the rest of the system processing
>>>of log records.  Log record offsets remain efficient, ie. they can use
>>>LSN's directly.  Only the boot time recovery code need look for the
>>>new log record and do the work to verify checksums, online abort is
>>>I would like to see some performance numbers on the checksum overhead
>>>and if it is measurable then maybe some discussion on checksum choice.
>>>An obvious first choice would seem to be the standard java provided one
>>>used on the data pages.  If I had it to do over, I would probably have
>>>used a different approach on the data pages.  The point of the checksum
>>>on the data page is not to catch data sector write errors, the system
>>>expects the device to catch those, the only point is to catch
>>>inconsistent sector writes (ie. 1st and 2nd 512 byte sector but not
>>>3rd and 4th), for this the current checksum is overkill.  For this one
>>>need not checksum every byte on the page,
>>>one can guarantee a consistent write with 1 bit per sector in the page.
>>>In the future we may want to revisit #3 if it looks like the stream log
>>>is an I/O bottleneck which can't be addressed by striping or some other
>>>hardware help like smart caching controllers.  I see it as a performance
>>>project rather than a correctness project.  It also is a lot more work
>>>and risk.  Note that this could be a good project for someone wanting to
>>>do some research in this area as it is implemented as a derby module
>>>where an alternate implementation could be dropped in if available.
>>>While I believe that we should address this issue, I should also note
>>>that in all my time working on cloudscape/derby I have never received a
>>>problem database (in that time any log related error would have come
>>>through me), that resulted from this out of order/imcomplete log
>>>write issue - this of course does not mean it has not happened just that
>>>it was not reported to us and/or did not affect the database in a
>>>noticable way.  We have actually never seen an out of order write from
>>>the data pages also - we have seen a few checksum errors but all of
>>>those were caused by a bad disk.
>>>On the upgrade issue, it may be time to start an upgrade thread.  Here
>>>are just some thoughts.  If doing option #2, it would be nice if the
>>>new code could still read the old log files and then optionally
>>>write the new log record or not.  Then if users wanted to run a
>>>release in a "soft" upgrade mode where they needed to be able to
>>>go back to the old software they could - they just would not get
>>>this fix.  On a "hard" upgrade the software should continue to read
>>>the old log files as they are currently formatted, and for any new
>>>log files it should begin writing the new log record.  Once the new
>>>log record make's it way into the log file accessing the db with the
>>>old software is unsupported (it will throw an error as it won't know
>>>what to do with the new log record).
>>partial log record writes that occur because of out-of order writes need to be handled
by recovery.
>>         Key: DERBY-96
>>         URL: http://issues.apache.org/jira/browse/DERBY-96
>>     Project: Derby
>>        Type: New Feature
>>  Components: Store
>>    Versions:
>>    Reporter: Suresh Thalamati
>>    Assignee: Suresh Thalamati
>>Incomplete log record write that occurs because of
>>an out of order partial writes gets recognized as complete during
>>recovery if the first sector and last sector happens to get written.
>> Current system recognizes incompletely written log records by checking
>>the length of the record that is stored in the beginning and end.
>> Format the log records are written to disk is:
>>  +----------+-------------+------------------+
>>  | length     |  LOG RECORD |    length   |
>>  +----------+-------------+------------------+
>>This mechanism works fine if sectors are written in sequential manner or
>>log record size is less than 2 sectors. I  believe on SCSI types disks
>>order is not necessarily sequential, SCSI disk drives may sometimes do a
>>reordering of the sectors to optimize the performance.  If a log record
>>that spans multiple disk sectors is being written to SCISI type of
>>devices,  it is possible that first and last sector written before the
>>crash; If this occurs recovery system will incorrectly  interpret the
>>log records was completely written and replay the record. This could
>>lead to recovery errors or data corruption.
>>This problem also will not occur if a disk drive has write cache with a
>>battery backup which will make sure I/O request will complete.

View raw message