db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Army <qoz...@gmail.com>
Subject Writing platform-specific line-endings to disk...
Date Fri, 17 Nov 2006 23:04:21 GMT
As part of my work for DERBY-1758 I'm looking at the XML binding test 
(lang/xmlBinding.java in the old harness, lang/XMLBindingTest.java in JUnit) and 
I noticed that the test, which counts characters as a simple sanity check for 
insertion of docs larger than 32k, returns different results on Linux vs 
Windows.  (Actually, Bryan Pendleton was the first one to notice this a while 
back when he was reviewing DERBY-688 changes).

Long story short, Xalan serialization (which is what Derby uses to serialize XML 
documents) inserts platform-specific line-endings (based on the "line.separator" 
System property) into XML documents for every newline.  This appears to be 
technically valid, so it is not a bug per se [1].  However, from a Derby 
perspective this means that someone who inserts the exact same XML document into 
an XML column on Windows vs on Linux will actually be inserting more characters 
in the former case than in the latter (because the Windows line separator is two 
characters).  Or put differently, when inserting an XML document on Windows an 
extra character is written to disk for every line in the XML document.  This 
does *not* happen with other character types (ex. CLOB).

My question, then, is this: Is it considered a "bug" in Derby if insertion of 
the same XML value by the user can lead to different data (namely, line ending 
characters) being written to disk for different platforms?

There appear to be two obvious ways to get around this problem: 1) add logic in 
Derby engine to take the result of Xalan serialization and replace 
platform-specific line-endings with "\n", or 2) change the XML binding test to 
always count line-endings as a single "character" for the sake of asserting 
character counts.

I'm leaning toward option 1, but am not particularly driven one way or the 
other.  If the answer to my above question is "Yes, it's a bug", then option 1 
is clearly the only option; otherwise option 2 makes the test pass and is easy 
to implement.  It does a feel a tad like cheating, though...

Comments/feedback are appreciated, if anyone has any.




I searched Jira for this and found a couple of relevant Xalan issues, especially 
XALANJ-2093 and XALANJ-1701.  There is apparently a new property introduced in 
Xalan 2.7 to allow the user to indicate what should happen with newlines, but 
that property is non-standard and would require Derby to use Xalan 2.7 in order 
to build.

Based on comments in the aforementioned XALANJ issues it looks like it is 
technically valid for Xalan to convert the newlines to platform-specific 
endings.  This seems to agree with the following quote from the w3c page on 


"When outputting a newline character in the instance of the data model, the 
serializer is free to represent it using any character sequence that will be 
normalized to a newline character by an XML parser, unless a specific mapping 
for the newline character is provided in a character map (see 9 Character Maps)."

I don't know what Xalan serialization does with character maps, but there is 
nothing explicit in Derby to specify use of such maps, so my (admittedly 
lacking) understanding is that it's okay for Xalan to return platform-specific 
line-endings when serializing.

View raw message