abdera-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Berry <chriswbe...@gmail.com>
Subject Re: Invalid byte 2 of 3-byte UTF-8 sequence.
Date Wed, 05 Sep 2007 19:14:13 GMT
Greetings,

We figured it out. AFAICT, both my issue and Herbert's are the same.
I believe this is a bug in Abdera.

There are actually two issues;

-----------------------
First ,  Abdera uses HttpClient's

         method.getResponseBodyAsStream();

in order to obtain a raw stream bytes for Woodstox. (which is the  
correct thing to do for performance)

But Woodstox does NOT assume UTF-8.  So it fails when parsing valid  
UTF-8 characters.

The fix is to change the following line in AbstractClientResponse

   public <T extends Element>Document<T> getDocument( Parser parser,   
ParserOptions options)
          throws ParseException {
     try {
       .......
       // Document<T> doc = parser.parse( getInputStream(), base,  
options);
       Document<T> doc = parser.parse(getReader(), base, options);
       ....

And to add the following method to AbstractClientResponse

   public java.io.Reader getReader() throws java.io.IOException {
     String header = getHeader("Content-Type");

     String type = "UTF-8"; // default to UTF-8
     java.util.regex.Matcher matcher = java.util.regex.Pattern.compile 
(".*charset\\s*\\=\\s*(\\S+).*").matcher(header);
     if (matcher.matches()) {
       System.out.println("@@@@@@@@@@@@@@@@@@@@@@ type = " + type);
        type = matcher.group(1);
     }

     return new java.io.InputStreamReader(getInputStream(), type);
   }

Although, there is likely a cleaner way to get the "charset" param in  
Abdera??

-----------------------------
Second,  Abdera is NOT adding the "charset" parameter (e.g.  
";charset=utf-8" ) to the Content-Type HTTP Header of the Response

So a fix might be to change the following line in BaseResponseContext::

   public BaseResponseContext(T base, boolean chunked) {
     this.base = base;
     setStatus(200);
     setStatusText("OK");
     this.chunked = chunked;
     try {

       //  setContentType(getContentType().toString());
       setContentType(getContentType().toString() + "; charset=utf-8");

     } catch (Exception e) {}
   }

Although there are likely better ways/places to accomplish this  
within Abdera.
Perhaps I need to set this in my SpringAbderaServlet??


I will add these details to the JIRA as well.
Thanks,
-- Chris 

On Sep 5, 2007, at 11:53 AM, James M Snell wrote:

> Hmmm... how odd.  Ok, let me explore a bit further.
>
> - James
>
> herbert wrote:
>> Hi!
>>
>> I've already tried that before.
>> Using the escape sequence \u00e4 also does *not* work.
>>
>> Herbert

S'all good  ---   chriswberry at gmail dot com




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message