abdera-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Berry <chriswbe...@gmail.com>
Subject Re: Invalid byte 2 of 3-byte UTF-8 sequence.
Date Wed, 05 Sep 2007 22:50:16 GMT
Thanks James,
Comments inline
--  Chris

On Sep 5, 2007, at 2:27 PM, James M Snell wrote:

> This definitely makes sense, although I'd have to say that the fact  
> that
> woodstox does not appear to be assuming UTF-8 when UTF-8 is the  
> standard
> default for XML is quite troubling.

Reviewing my changes, this may not truly be an Abdera bug,
although my code below does workaround the problem.

When we call;

              FOMParser.parse( InputStream is,...

it subsequently call Axiom's

             StAXUtils.createXMLStreamReader(in, charset);

(when there is a charset -- which there should be -- at least in my  
case)
This presumably should create a Reader with the proper charset??
But it definitely does not. So there is a bug somewhere in Axiom or  
possibly even Woodstox??

So what is happening is that the Reader (created by StAXUtils and  
subsequently Woodstox)
uses the default encoding (MacRoman in my case)
Which is the reason why it works in Linux -- the default encoding is  
UTF-8.

I don't know what Herbert's default encoding is....

> Would it be possible for you to put together a patch file with these
> changes?

I would gladly produce a patch.
BUT I really think you need to decide how to handle this.
When I call

              FOMParser.parse( Reader rr,...

This bypasses a bit of code.

IMHO, I think that you should simply roll the required   
"FOMParser.parse( InputStream is,..."  code into  "FOMParser.parse 
( Reader rr,... "
And not rely on the underlying code to do the right thing.

> Oh, and for the Content-Type header, the right thing to do is call the
> getCharacterEncoding method on ClientResponse.  You will still need to
> verify that the value specified for the parameter is correct

So this should be something like this.....

   public BaseResponseContext(T base, boolean chunked) {
     this.base = base;
     setStatus(200);
     setStatusText("OK");
     this.chunked = chunked;
     try {
            //  setContentType(getContentType().toString());
            setContentType(getContentType().toString() + "; charset="  
+ getCharacterEncoding() );
     } catch (Exception e) {}
   }

>
> - James
>
> Chris Berry wrote:
>> Greetings,
>>
>> We figured it out. AFAICT, both my issue and Herbert's are the same.
>> I believe this is a bug in Abdera.
>>
>> There are actually two issues;
>>
>> -----------------------
>> First ,  Abdera uses HttpClient's
>>
>>         method.getResponseBodyAsStream();
>>
>> in order to obtain a raw stream bytes for Woodstox. (which is the
>> correct thing to do for performance)
>>
>> But Woodstox does NOT assume UTF-8.  So it fails when parsing valid
>> UTF-8 characters.
>>
>> The fix is to change the following line in AbstractClientResponse
>>
>>   public <T extends Element>Document<T> getDocument( Parser parser,
>> ParserOptions options)
>>          throws ParseException {
>>     try {
>>       .......
>>       // Document<T> doc = parser.parse( getInputStream(), base,  
>> options);
>>       Document<T> doc = parser.parse(getReader(), base, options);
>>       ....
>>
>> And to add the following method to AbstractClientResponse
>>
>>   public java.io.Reader getReader() throws java.io.IOException {
>>     String header = getHeader("Content-Type");
>>
>>     String type = "UTF-8"; // default to UTF-8
>>     java.util.regex.Matcher matcher =
>> java.util.regex.Pattern.compile(".*charset\\s*\\=\\s*(\\S 
>> +).*").matcher(header);
>>
>>     if (matcher.matches()) {
>>       System.out.println("@@@@@@@@@@@@@@@@@@@@@@ type = " + type);
>>        type = matcher.group(1);
>>     }
>>
>>     return new java.io.InputStreamReader(getInputStream(), type);
>>   }
>>
>> Although, there is likely a cleaner way to get the "charset" param in
>> Abdera??
>>
>> -----------------------------
>> Second,  Abdera is NOT adding the "charset" parameter (e.g.
>> ";charset=utf-8" ) to the Content-Type HTTP Header of the Response
>>
>> So a fix might be to change the following line in  
>> BaseResponseContext::
>>
>>   public BaseResponseContext(T base, boolean chunked) {
>>     this.base = base;
>>     setStatus(200);
>>     setStatusText("OK");
>>     this.chunked = chunked;
>>     try {
>>
>>       //  setContentType(getContentType().toString());
>>       setContentType(getContentType().toString() + ";  
>> charset=utf-8");
>>
>>     } catch (Exception e) {}
>>   }
>>
>> Although there are likely better ways/places to accomplish this  
>> within
>> Abdera.
>> Perhaps I need to set this in my SpringAbderaServlet??
>>
>>
>> I will add these details to the JIRA as well.
>> Thanks,
>> -- Chris
>> On Sep 5, 2007, at 11:53 AM, James M Snell wrote:
>>
>>> Hmmm... how odd.  Ok, let me explore a bit further.
>>>
>>> - James
>>>
>>> herbert wrote:
>>>> Hi!
>>>>
>>>> I've already tried that before.
>>>> Using the escape sequence \u00e4 also does *not* work.
>>>>
>>>> Herbert
>>
>> S'all good  ---   chriswberry at gmail dot com
>>
>>
>>
>>

S'all good  ---   chriswberry at gmail dot com




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message