ibatis-user-java mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joachim Hoffmann <j.hoffm...@mine-it.at>
Subject xml encoding <> BOM <> xml parse error: no content in prolog
Date Wed, 14 Sep 2005 11:05:31 GMT
I was banging my head against this xml problem last night
to get the xml files for iBatis parsed on a German Windows 2000 I had to
set the encoding="ISO-8859-1" and convert the file to ASCII-DOS in 

Having UTF-8 encoding includes the famous, but invisible 3 Bytes FF BB 
BF at the beginning,
which are interpreted by the Java Resource-Reader as data, and therefore 
creating the XML
parse error. (visibilty increases to VMC on HexEditor or 
System.out.print((char)reader.read()) )

It's also an UltraEdit issue, which likes to keep the UTF-16 BOM of FF 
FE for UTF-8 files (U8-DOS),
unless you convernt to UTF-8(ASCII-Editing)  ... i.e. not  UTF-8(Unicode 

very helpful is WikiPedia on  Byte Order Mark

The cause of all the trouble seems to be that Windows based editors seem 
to append a
BOM at the beginning of  UTF-8 files, what Java-Reader seems to 
interpret directly as data.

IMHO htis is a show stopper for iBatis, even it is caused by others!
I would recommend to  describe it somewhere in the Tutorial and FAQ.

Hope this helps others,

outrageous to find the following Java BUG ... since 2001
workaround readers listed there!
*Bug ID:* 	4508058
*Votes* 	74
*Synopsis* 	UTF-8 encoding does not recognize initial BOM
*Category* 	java:char_encodings
*Reported Against* 	1.4.2_05 , merlin-beta
*Release Fixed* 	
*State* 	In progress, bug
*Related Bugs* 	
*Submit Date* 	27-SEP-*2001*

Java does not recognize the optional BOM which can begin a UTF-8 stream.  
It treats the BOM as if it were the initial character of the stream.A
 Utf-8 stream can optionally beign with a byte order mark 
(see, for example http://www.unicode.org.unicode/faq/utf_bom.html).
 This is the character FEFF, which is represented as EF BB BF in utf-8. 
Java's utf-8 encoding does not recognize this character as a BOM, though; 
the result of reading such a stream is a set of characters bginning with FEFF.

*Work Around* 	

Application code must recognize and skip the BOM itself.

View raw message