db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olav Sandstaa <Olav.Sands...@Sun.COM>
Subject Re: Options for syncing of log to disk
Date Fri, 01 Sep 2006 11:59:36 GMT

Thanks for your thought on this, see my comments further down.

Mike Matrigali wrote:
> Initial writes to the log are to a preallocated log, but once it
> is filled it is possible for there to be writes that extend the log
> and thus it is not safe to not sync metadata that tracks the
> length of the file.

I agree that this is a problem that needs to be looked into if we would 
consider to change from using "rws" to "rwd" for the log files. My main 
purpose with sending out the performance numbers was to illustrate that 
there on some platforms is a potential for performance improvements and 
to get feedback on what issues that needed to be looked into in order to 
take advantage of this.

> Unfortunately this behavior is hardware, OS and JVM specific, and
> the exact
> meaning of rws and rwd is left vague in the javadoc that I have read.
> The javadoc usually says syncs "metadata" but does not explain what
> metadata.  

I certainly agree that the javadoc is vague. For RandomAcceFile the 
javadoc says [1]:

 "The "rws" and "rwd" modes work much like the |force(boolean)| 

method of the |FileChannel| 
class, passing arguments of true and false, respectively, except that 
they always apply to every I/O operation and are therefore often more 
efficient. If the file resides on a local storage device then when an 
invocation of a method of this class returns it is guaranteed that all 
changes made to the file by that invocation will have been written to 
that device. This is useful for ensuring that critical information is 
not lost in the event of a system crash."

It is very unclear if the last sentence refer to only files opened with 
"rws" mode or also holds true when opened with "rwd" mode.

> When I worked on this issue for another db vendor, direct
> OS access usually provide 3 rather than 2 options.  The 3 options
> were:
> 1) no metadata sync
> 2) only sync file allocation metadata
> 3) sync file allocation metadata and other metadata.  The problem
>    is that other metadata includes the last modified time info
>    which is updated every write to the file.
> What do you mean by "most OS"?

Solaris and FreeBSD :-) I have also tried this on Linux 2.6 (Red Hat 
4.0?) , but Linux seems to handle "rwd" the same as "rws" (or rather 
files where the application requests to be opened with the O_DSYNC flag 
is actually opened with just O_SYNC).

The JavaDoc for RandomAccessFile is also indicating that using rws needs 
two updates to the disk: "using "rws" requires updates to both the 
file's content and its metadata to be written, which generally requires 
at least one more low-level I/O operation." [1] Based on this I also 
assumed that this would be a penalty on "most OS".

> What OS/JVM are your numbers from?

My numbers where from a machine running Solaris 10 x86 and Sun JVM 1.5.

> When the sync option on the log was switched from using full file
> sync to "rws" mode tests were run which I believe included linux 
> (probably only a single version of linux - not sure which) and
> XP with sun and ibm jvms (probably 1.4.2 as I think that was the latest
> JVM at the time), I think apple OS was also tested but I am not sure. 
> The first implementation simply switched the
> to the "rws" mode but left the log file to grow as needed, "rws" mode
> was picked because it is impossible to tell if file allocation metadata
> is synced as part of "rwd" so in order to guarantee transaction
> consistency the safest mode was picked.  Tests were run which observed
> if we preallocated the log file then I/O to a preallocated file that
> did not extend the file only paid 1 I/O per sync.  So work was done
> to make most log I/O only happen to a preallocated file, but the logging
> system was not changed to guarantee all I/O was to a preallocated file.

Thanks for the background for how and why "rws" was selected. Based on 
your observations and the unclear semantics for "rwd" and metadata I 
agree that this was a good choice. Still, I think paying the extra cost 
of having to do two disk operations per log write on "some OSs" is high 
and can make Derby perform worse than some of the other open-source 
databases on these OSs.

> It is probably worth resurrecting the simple I/O test program, to let
> people run on their various JVM/OS combinations.  As has been noted in
> the past the results of such a test can be thrown way off by the
> hardware involved.  If the hardware/filesystem has had write cache
> enabled then none of these syncs can do their job and transactions are
> at risk no matter what option is picked.

Sounds like a very good idea to get data for how various JVM/OS 
combinations are handling this.

> Also it is more common nowadays for higher end hardware to have 
> battery backed cache to
> optimize the sync case, which then provides instantaneous return from
> the sync request but provides safe transaction as it guarantees the
> write on failure (I've seen this as part of the disk and as part of
> the controller).  This particular hardware feature works VERY well for
> the derby log I/O case as the block being synced for the log file
> metadata tends to be the same block over and over again so basically
> the cache space for it is on the order of 8k.

And it is very common for lower end hardware to have the disk's write 
cache enabled to get similar performance. And most users will be very 
happy with this and unaware of the consequences until one day their 
favorite database is unable to recover after a power failure....


[1] http://java.sun.com/javase/6/docs/api/java/io/RandomAccessFile.html

> Olav Sandstaa wrote:
>> For writing the transaction log to disk Derby uses a
>> RandomAccessFile. If it is supported by the JVM, the log files are
>> opened in "rws" mode making the file system take care of syncing
>> writes to disk. "rws" mode will ensure that both the data and the file
>> meta-data is updated for every write to the file. On most operating
>> system this leads to two write operation to the disk for every write
>> issued by Derby. This is limiting the throughput of update intensive
>> applications.
>> I have run some simple tests where I have changed mode from "rws" to
>> "rwd" for the Derby log file. When running a small numbers of
>> concurrent client threads the throughput is almost doubled and the
>> response time is almost halved. I am enclosing two graphs that show
>> this when running a given number of concurrent "tpc-b" clients. The
>> graphs show the throughput when running with "rws" and "rwd" mode 
>> when the
>> disk's write cache has been enabled and disabled.
>> This change should also have a positive impact on the Derby startup
>> time (DERBY-1664) and derbyall. With this change the time for running
>> derbyall goes down by about 10-15 minutes (approximately 10%) :-)
>> Is there anyone that is aware of any issues by not updating the file
>> meta-data for every write? Is there any recovery scenarios where this
>> can make recovery fail? Derby seems to preallocates the log file
>> before starting using the file, so I think this should not influence
>> the ability to fine the last data written to the file after a power
>> failure.
>> Any comments?
>> Thanks,
>> Olav
>> ------------------------------------------------------------------------
>> ------------------------------------------------------------------------

View raw message