axis-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Mall <man...@apache.org>
Subject Re: Two questions - BOM in UTF-8, and manually cleaning XML
Date Thu, 06 Jul 2006 00:10:03 GMT
On Thursday 06 July 2006 04:35, Matthew Brown wrote:
> Davanum,
>
> I had tried this previously and the only effect that I noticed was
> that the encoding attribute of my request message's prolog changed.
> The response message was still being parsed as UTF-8 (which the
> headers had said) although it was truly 16.
>
> Anyway, now that the service provider has changed their service to
> return true UTF-8 data, and Xerces still has trouble interpreting the
> UTF-8 BOM before the prolog, I have found a very hack-ish solution:
> Add a handler that will remove any characters in the currentMessage
> if the MessageContext is past the pivot. This doesn't feel like a
> great solution to me (why isn't the XML parser prepared to handle the
> BOM? Is the wrong parse method being used?), it works for us for
> right now.
>

Good to hear you found a workaround. I must admit that problem intrigued 
me a bit and after some "googling" I came across a post that said: 
Xerces can handle UTF-8 with BOM if given a chance to do so, that is if 
Xerces is given an InputStream to parse the XML as Xerces will wrap 
this into its own UTF BOM aware Reader. If Xerces however is given a 
Reader object instead of an InputStream its the supplied Reader which 
determines the encoding (and there seems to be a known problem in this 
area with the default Java Reader + UTF-8 + BOM).

Not sure if this is the cause of why your message cannot be decoded 
within Axis as I don't know how Axis invokes the SAX Parser. Those more 
familiar with the internals of the Axis code may possibly be able to 
assess this.

Manuel

> Thanks for the help all
> Matt
>
> ---------
>
> package com.viecore.ipl.ws;
>
> import javax.xml.soap.SOAPMessage;
>
> import org.apache.axis.AxisFault;
> import org.apache.axis.Message;
> import org.apache.axis.MessageContext;
> import org.apache.axis.SOAPPart;
> import org.apache.axis.handlers.BasicHandler;
> import org.apache.log4j.LogManager;
> import org.apache.log4j.Logger;
>
> public class MyHandler extends BasicHandler {
>
> 	private static Logger log = LogManager.getLogger(MyHandler.class);
>
>
> 	public void invoke(MessageContext messageContext) throws AxisFault {
>
> 		try {
> 			if (log.isInfoEnabled()) log.info("invoke - start");
> 			log.info("invoke - past pivot [" + messageContext.getPastPivot() +
> "]");
>
> 			SOAPMessage rpcMsg = messageContext.getMessage();
>
> 			if (rpcMsg instanceof Message) {
> 				Message axisMsg = (Message) rpcMsg;
>
> 				if (log.isDebugEnabled()) log.debug("invoke - cast
> java.xml.rpc.SOAPMessage to org.apache.axis.Message");
>
> 				javax.xml.soap.SOAPPart rpcPart = axisMsg.getSOAPPart();
> 				if (rpcPart instanceof SOAPPart) {
> 					SOAPPart axisPart = (SOAPPart) rpcPart;
>
> 					if (log.isDebugEnabled()) log.debug("invoke - cast
> java.xml.rpc.SOAPPart to org.apache.axis.SOAPPart");
>
> 					Object currentMessage = axisPart.getCurrentMessage();
> 					if (currentMessage == null) {
> 						log.debug("invoke - current message is null, cannot clean");
> 					}
> 					else {
> 						if (log.isDebugEnabled())
> 							log.debug("invoke - current message of SOAP part has type [" +
> currentMessage.getClass().getName() + "] content [" +
> currentMessage.toString() + "]");
>
> 						// attempt to remove bad characters from the response
> 						if (messageContext.getPastPivot() == true) {
>
> 							if (currentMessage instanceof String) {
> 								String strMessage = (String) currentMessage;
> 								int idx = strMessage.indexOf("<?xml");
> 								if (idx == -1) {
> 									log.warn("invoke - Could not find xml prolog in response
> message"); }
> 								else {
> 									String cleaned = strMessage.substring(idx);
>
> 									log.debug("invoke - Setting SOAPPart.currentMessage to: " +
> cleaned);
>
> 									axisPart.setCurrentMessage(cleaned,
> axisPart.getCurrentForm()); }
> 							}
> 						}
> 					}
> 				}
> 			}
> 			if (log.isInfoEnabled()) log.info("invoke - complete");
> 		}
> 		catch (Exception ex) {
> 			log.error("Caught exception in invoke()", ex);
> 		}
> 	}
>
> }
>
> -----Original Message-----
> From: Davanum Srinivas [mailto:davanum@gmail.com]
> Sent: Wednesday, July 05, 2006 3:41 PM
> To: axis-user@ws.apache.org
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> did you see my response on setting the CHARACTER_SET_ENCODING? what
> is the exact stack trace you get on the client?
>
> thanks,
> dims
>
> On 7/5/06, Matthew Brown <matthew.brown@viecore.com> wrote:
> > text/xml and utf-8, which I suppose explains the attempt to parse
> > the UTF-16 message as UTF-8. The customer has changed the format of
> > the message to correctly be UTF-8 in actuality, although Xerces
> > still isn't a fan of the UTF-8 BOM (ef bb bf).
> >
> >
> >
> > -----Original Message-----
> > From: Simon Fell [mailto:sfell@salesforce.com]
> > Sent: Wednesday, July 05, 2006 2:46 PM
> > To: axis-user@ws.apache.org
> > Subject: RE: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> >
> > What does the content-type header say the charset is? That takes
> > precedence over the payload (at least for SOAP 1.1)
> >
> > Cheers
> > Simon
> >
> > -----Original Message-----
> > From: Rodrigo Ruiz [mailto:rruiz@gridsystems.com]
> > Sent: Wednesday, July 05, 2006 8:30 AM
> > To: axis-user@ws.apache.org
> > Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > XML
> >
> > Maybe changing the xml prolog from "utf-8" to "utf-16" will be
> > easier. It seems like a demo example for a servlet filter ;-)
> >
> >
> > Hope this helps,
> > Rodrigo
> >
> > Manuel Mall wrote:
> > > On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> > >> Two bytes per char; Etherpeak is showing the second byte as 00.
> > >
> > > Seems you are stuck between a "rock and a hard place" here. The
> > > byte stream appears to be correctly utf-16 encoded but the xml
> > > prolog says utf-8. Not sure what to recommend. Fix it at the
> > > source is obvious but not easily done. You may be able to write a
> > > handler that re-encodes the byte stream into utf-8 before giving
> > > it to the Axis stacks. But how to write such an Axis handler and
> > > how to hook it correctly into the Axis processing chain is
> > > outside my area of expertise.
> > >
> > > May be someone else can give advice on how to attempt such a
> > > thing.
> > >
> > > Manuel
> > >
> > >> -----Original Message-----
> > >> From: Manuel Mall [mailto:manuel@apache.org]
> > >> Sent: Wednesday, July 05, 2006 11:09 AM
> > >> To: axis-user@ws.apache.org
> > >> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
> > >> XML
> > >>
> > >> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> > >>> Manuel,
> > >>>
> > >>> I believe you hit the problem on the head - the response prolog
> > >>> says utf-8 but (according to Etherpeak) the BOM is ff/ef.
> > >>> Coincidentally, by the time the response XML gets logged by
> > >>> axis, these initial characters are logged as ef bf bd ef bf bd.
> > >>
> > >> Matt,
> > >>
> > >> what about the rest of the byte stream when you look at it in
> > >> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8
> > >> encoded (1 byte per char for all typical ascii characters)?
> > >>
> > >> Manuel
> > >>
> > >>> Unfortunately we may be in a bit of a tough place with having
> > >>> the producer of the XML change it; the customer whose web
> > >>> services we are consuming doesn't seem to see any issue with
> > >>> this (as they are fine with their .NET tools).
> > >>>
> > >>> If it is the case where we are seeing a UTF-16 BOM but a prolog
> > >>> that declares UTF-8; is there any way to instruct Axis/Xerces
> > >>> to parse it as UTF-16? Sorry if this question doesn't make much
> > >>> sense, but I'm not too familiar with how Axis and/or Xerces
> > >>> decide which character encoding to use when reading the XML.
> > >>>
> > >>> Thanks again
> > >>> Matt
> > >>>
> > >>> -----Original Message-----
> > >>> From: Manuel Mall [mailto:manuel@apache.org]
> > >>> Sent: Wednesday, July 05, 2006 10:58 AM
> > >>> To: axis-user@ws.apache.org
> > >>> Subject: Re: Two questions - BOM in UTF-8, and manually
> > >>> cleaning XML
> > >>>
> > >>> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> > >>>> Yes, there is a work-around. It works if you encode the file
> > >>>> with UTF-8 (for example), and do not include the BOM at the
> > >>>> beginning. I use notepad++ for that task, where you can save
> > >>>> in "UTF-8 without BOM".
> > >>>>
> > >>>> The process for that is easy:
> > >>>> 1. open the file in notepad++
> > >>>> 2. mark everything via CTRL-A
> > >>>> 3. cut (not copy!)
> > >>>> 4. in the format menu, choose "ANSI" formatting and select
> > >>>> "UTF without BOM" at the bottom 5. paste 6. save.
> > >>>>
> > >>>> that is a crap workaround, but works for me. for automatically
> > >>>> generated files ..... I dunno :-)
> > >>>>
> > >>>>
> > >>>> Greetings,
> > >>>> Axel.
> > >>>>
> > >>>>
> > >>>> On 7/5/06, Matthew Brown < matthew.brown@viecore.com
> > >>>> <mailto:matthew.brown@viecore.com> > wrote:
> > >>>>
> > >>>> Hi all,
> > >>>>
> > >>>> I hate to do this, but can anyone please help me with either
> > >>>> of these issues? I've tried to upgrade Xerces to 2.8.0 but to
> > >>>> no avail.
> > >>>>
> > >>>> Is there anything else I could be doing?
> > >>>
> > >>> Just wondering if your file in question starts with hex 'ef bb
> > >>> bf' or 'ff ef' or 'ef ff'. If it is one of the latter two forms
> > >>> I believe you have an utf-16 encoded file (little endian or big
> > >>> endian) not utf-8. If it is the 'ef bb bf' sequence then it
> > >>> starts correctly with the utf-8 encoded unicode code point for
> > >>> BOM U+FEFF. In all cases xerces should be able to handle it. A
> > >>> problem may arise if it starts with 'ff ef' but the XML prolog
> > >>> says encoding="utf-8" as that is a contradiction I believe.
> > >>>
> > >>> I know this does not help directly but may help to check if the
> > >>> problem is with the producer of the XML document or your
> > >>> consumer.
> > >>>
> > >>> Manuel
> > >>>
> > >>>> What about the possibility of programmatically
> > >>>> editing/cleaning the response XML before it is given to the
> > >>>> parser?
> > >>>>
> > >>>> Thanks
> > >>>> Matt
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Matthew Brown [mailto: matthew.brown@viecore.com
> > >>>> <mailto:matthew.brown@viecore.com> ]
> > >>>> Sent: Saturday, July 01, 2006 12:41 PM
> > >>>> To: axis-user@ws.apache.org <mailto:axis-user@ws.apache.org>
> > >>>> Subject: Two questions - BOM in UTF-8, and manually cleaning
> > >>>> XML
> > >>>>
> > >>>>
> > >>>> 1. From searching the mailing list archives, I see several
> > >>>> references to people having problems with Byte Order Mark
> > >>>> characters appearing before the prolog in their UTF-8
> > >>>> messages. However I can't seem to find much of a known
> > >>>> resolution to these issues. Is there a standard/common
> > >>>> workaround for these BOM and UTF-8 issues?
> > >>>>
> > >>>> 2. If there is no answer to my #1, is there anyway that Axis
> > >>>> will allow me to pragmatically edit the response XML before it
> > >>>> is passed to the parser and de-serialized? I've tried adding
> > >>>> Handlers, but I'm assuming that the Handler comes into the
> > >>>> picture after the message is parsed, because my Handler is
> > >>>> only ever seeing the request message, and not the response.
> > >>>>
> > >>>> Thanks
> > >>>> Matt Brown
> > >>>
> > >>> ---------------------------------------------------------------
> > >>>---- -- To unsubscribe, e-mail:
> > >>> axis-user-unsubscribe@ws.apache.org For additional commands,
> > >>> e-mail: axis-user-help@ws.apache.org
> > >>
> > >> ----------------------------------------------------------------
> > >>----- To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
> > >> For additional commands, e-mail: axis-user-help@ws.apache.org
> > >
> > > -----------------------------------------------------------------
> > >---- To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
> > > For additional commands, e-mail: axis-user-help@ws.apache.org
> >
> > --
> > -------------------------------------------------------------------
> > GRIDSYSTEMS                    Rodrigo Ruiz Aguayo
> > Parc Bit - Son Espanyol
> > 07120 Palma de Mallorca        mailto:rruiz@gridsystems.com
> > Baleares - España              Tel:+34-971435085 Fax:+34-971435082
> > http://www.gridsystems.com
> > -------------------------------------------------------------------
> >
> >
> > --
> > No virus found in this outgoing message.
> > Checked by AVG Free Edition.
> > Version: 7.1.394 / Virus Database: 268.9.9/382 - Release Date:
> > 04/07/2006
> >
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org For
> > additional commands, e-mail: axis-user-help@ws.apache.org
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org For
> > additional commands, e-mail: axis-user-help@ws.apache.org
> >
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org For
> > additional commands, e-mail: axis-user-help@ws.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: axis-user-unsubscribe@ws.apache.org
For additional commands, e-mail: axis-user-help@ws.apache.org


Mime
View raw message