axis-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Mall <man...@apache.org>
Subject Re: Two questions - BOM in UTF-8, and manually cleaning XML
Date Wed, 05 Jul 2006 15:23:58 GMT
On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> Two bytes per char; Etherpeak is showing the second byte as 00.
>
Seems you are stuck between a "rock and a hard place" here. The byte 
stream appears to be correctly utf-16 encoded but the xml prolog says 
utf-8. Not sure what to recommend. Fix it at the source is obvious but 
not easily done. You may be able to write a handler that re-encodes the 
byte stream into utf-8 before giving it to the Axis stacks. But how to 
write such an Axis handler and how to hook it correctly into the Axis 
processing chain is outside my area of expertise.

May be someone else can give advice on how to attempt such a thing.

Manuel
> -----Original Message-----
> From: Manuel Mall [mailto:manuel@apache.org]
> Sent: Wednesday, July 05, 2006 11:09 AM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> > Manuel,
> >
> > I believe you hit the problem on the head - the response prolog
> > says utf-8 but (according to Etherpeak) the BOM is ff/ef.
> > Coincidentally, by the time the response XML gets logged by axis,
> > these initial characters are logged as ef bf bd ef bf bd.
>
> Matt,
>
> what about the rest of the byte stream when you look at it in
> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
> (1 byte per char for all typical ascii characters)?
>
> Manuel
>
> > Unfortunately we may be in a bit of a tough place with having the
> > producer of the XML change it; the customer whose web services we
> > are consuming doesn't seem to see any issue with this (as they are
> > fine with their .NET tools).
> >
> > If it is the case where we are seeing a UTF-16 BOM but a prolog
> > that declares UTF-8; is there any way to instruct Axis/Xerces to
> > parse it as UTF-16? Sorry if this question doesn't make much sense,
> > but I'm not too familiar with how Axis and/or Xerces decide which
> > character encoding to use when reading the XML.
> >
> > Thanks again
> > Matt
> >
> > -----Original Message-----
> > From: Manuel Mall [mailto:manuel@apache.org]
> > Sent: Wednesday, July 05, 2006 10:58 AM
> > To: axis-user@ws.apache.org
> > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> > On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > > Yes, there is a work-around. It works if you encode the file with
> > > UTF-8 (for example), and do not include the BOM at the beginning.
> > > I use notepad++ for that task, where you can save in "UTF-8
> > > without BOM".
> > >
> > > The process for that is easy:
> > > 1. open the file in notepad++
> > > 2. mark everything via CTRL-A
> > > 3. cut (not copy!)
> > > 4. in the format menu, choose "ANSI" formatting and select "UTF
> > > without BOM" at the bottom
> > > 5. paste
> > > 6. save.
> > >
> > > that is a crap workaround, but works for me. for automatically
> > > generated files ..... I dunno :-)
> > >
> > >
> > > Greetings,
> > > Axel.
> > >
> > >
> > > On 7/5/06, Matthew Brown < matthew.brown@viecore.com
> > > <mailto:matthew.brown@viecore.com> > wrote:
> > >
> > > Hi all,
> > >
> > > I hate to do this, but can anyone please help me with either of
> > > these issues? I've tried to upgrade Xerces to 2.8.0 but to no
> > > avail.
> > >
> > > Is there anything else I could be doing?
> >
> > Just wondering if your file in question starts with hex 'ef bb bf'
> > or 'ff ef' or 'ef ff'. If it is one of the latter two forms I
> > believe you have an utf-16 encoded file (little endian or big
> > endian) not utf-8. If it is the 'ef bb bf' sequence then it starts
> > correctly with the utf-8 encoded unicode code point for BOM U+FEFF.
> > In all cases xerces should be able to handle it. A problem may
> > arise if it starts with 'ff ef' but the XML prolog says
> > encoding="utf-8" as that is a contradiction I believe.
> >
> > I know this does not help directly but may help to check if the
> > problem is with the producer of the XML document or your consumer.
> >
> > Manuel
> >
> > > What about the possibility of programmatically editing/cleaning
> > > the response XML before it is given to the parser?
> > >
> > > Thanks
> > > Matt
> > >
> > > -----Original Message-----
> > > From: Matthew Brown [mailto: matthew.brown@viecore.com
> > > <mailto:matthew.brown@viecore.com> ]
> > > Sent: Saturday, July 01, 2006 12:41 PM
> > > To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
> > > Subject: Two questions - BOM in UTF-8, and manually cleaning XML
> > >
> > >
> > > 1. From searching the mailing list archives, I see several
> > > references to people having problems with Byte Order Mark
> > > characters appearing before the prolog in their UTF-8 messages.
> > > However I can't seem to find much of a known resolution to these
> > > issues. Is there a standard/common workaround for these BOM and
> > > UTF-8 issues?
> > >
> > > 2. If there is no answer to my #1, is there anyway that Axis will
> > > allow me to pragmatically edit the response XML before it is
> > > passed to the parser and de-serialized? I've tried adding
> > > Handlers, but I'm assuming that the Handler comes into the
> > > picture after the message is parsed, because my Handler is only
> > > ever seeing the request message, and not the response.
> > >
> > > Thanks
> > > Matt Brown
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org For
> > additional commands, e-mail: axis-user-help@ws.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
> For additional commands, e-mail: axis-user-help@ws.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
For additional commands, e-mail: axis-user-help@ws.apache.org


Mime
View raw message