abdera-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James M Snell <jasn...@gmail.com>
Subject Re: Invalid byte 2 of 3-byte UTF-8 sequence.
Date Wed, 05 Sep 2007 19:34:04 GMT
Oh, and if you are able to put together a patch file, please post it to
jira. :-)

- James

Chris Berry wrote:
> Greetings,
> 
> We figured it out. AFAICT, both my issue and Herbert's are the same.
> I believe this is a bug in Abdera.
> 
> There are actually two issues;
> 
> -----------------------
> First ,  Abdera uses HttpClient's
> 
>         method.getResponseBodyAsStream();
> 
> in order to obtain a raw stream bytes for Woodstox. (which is the
> correct thing to do for performance)
> 
> But Woodstox does NOT assume UTF-8.  So it fails when parsing valid
> UTF-8 characters.
> 
> The fix is to change the following line in AbstractClientResponse
> 
>   public <T extends Element>Document<T> getDocument( Parser parser, 
> ParserOptions options)
>          throws ParseException {
>     try {
>       .......
>       // Document<T> doc = parser.parse( getInputStream(), base, options);
>       Document<T> doc = parser.parse(getReader(), base, options);
>       ....
> 
> And to add the following method to AbstractClientResponse
> 
>   public java.io.Reader getReader() throws java.io.IOException {
>     String header = getHeader("Content-Type");
> 
>     String type = "UTF-8"; // default to UTF-8
>     java.util.regex.Matcher matcher =
> java.util.regex.Pattern.compile(".*charset\\s*\\=\\s*(\\S+).*").matcher(header);
> 
>     if (matcher.matches()) {
>       System.out.println("@@@@@@@@@@@@@@@@@@@@@@ type = " + type);
>        type = matcher.group(1);
>     }
> 
>     return new java.io.InputStreamReader(getInputStream(), type);
>   }
> 
> Although, there is likely a cleaner way to get the "charset" param in
> Abdera??
> 
> -----------------------------
> Second,  Abdera is NOT adding the "charset" parameter (e.g.
> ";charset=utf-8" ) to the Content-Type HTTP Header of the Response
> 
> So a fix might be to change the following line in BaseResponseContext::
> 
>   public BaseResponseContext(T base, boolean chunked) {
>     this.base = base;
>     setStatus(200);
>     setStatusText("OK");
>     this.chunked = chunked;
>     try {
> 
>       //  setContentType(getContentType().toString());
>       setContentType(getContentType().toString() + "; charset=utf-8");
> 
>     } catch (Exception e) {}
>   }
> 
> Although there are likely better ways/places to accomplish this within
> Abdera.
> Perhaps I need to set this in my SpringAbderaServlet??
> 
> 
> I will add these details to the JIRA as well.
> Thanks,
> -- Chris
> On Sep 5, 2007, at 11:53 AM, James M Snell wrote:
> 
>> Hmmm... how odd.  Ok, let me explore a bit further.
>>
>> - James
>>
>> herbert wrote:
>>> Hi!
>>>
>>> I've already tried that before.
>>> Using the escape sequence \u00e4 also does *not* work.
>>>
>>> Herbert
> 
> S'all good  ---   chriswberry at gmail dot com
> 
> 
> 
> 

Mime
View raw message