Return-Path: Delivered-To: apmail-incubator-abdera-user-archive@locus.apache.org Received: (qmail 80073 invoked from network); 5 Sep 2007 19:34:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Sep 2007 19:34:45 -0000 Received: (qmail 13513 invoked by uid 500); 5 Sep 2007 19:34:40 -0000 Delivered-To: apmail-incubator-abdera-user-archive@incubator.apache.org Received: (qmail 13441 invoked by uid 500); 5 Sep 2007 19:34:40 -0000 Mailing-List: contact abdera-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: abdera-user@incubator.apache.org Delivered-To: mailing list abdera-user@incubator.apache.org Received: (qmail 13432 invoked by uid 99); 5 Sep 2007 19:34:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2007 12:34:40 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jasnell@gmail.com designates 66.249.82.227 as permitted sender) Received: from [66.249.82.227] (HELO wx-out-0506.google.com) (66.249.82.227) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2007 19:34:35 +0000 Received: by wx-out-0506.google.com with SMTP id h30so2061177wxd for ; Wed, 05 Sep 2007 12:34:15 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:x-enigmail-version:content-type:content-transfer-encoding; b=JdIgPfoqgdkSIoEtI9QFb0YncBfIzU+XzaCsa9ilxU81cz5Dg7LDLAF7v8HbIzvGJne1oPNTyDaT6h6wx9gquWJZaGDyoaUiU5SKcl3m2MHIJ798RDVbcPv4j/7I1r7ty2yBsaCa0kflo73w/3U/SAHK311uohxUoCWWaIFSbcg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:x-enigmail-version:content-type:content-transfer-encoding; b=HFR2p/0kt8ha2HAKVVMfzKkMEwn5WlbKC8pamr6mWe4UJ33Sa8evvefCGQddWuI2idMNkaOy9Y8OaQK0k0tQvVEdBk2E8fvZX8KPj8G8iggtAaqY2aguGsvRV2slwz4XfnCODnM/mDOY6izM0UNH4ZUCQynGBxN/fi1byiz0JMk= Received: by 10.90.78.9 with SMTP id a9mr7617027agb.1189020854471; Wed, 05 Sep 2007 12:34:14 -0700 (PDT) Received: from ?192.168.1.2? ( [67.181.218.96]) by mx.google.com with ESMTPS id 74sm6845335wra.2007.09.05.12.34.10 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 05 Sep 2007 12:34:12 -0700 (PDT) Message-ID: <46DF04AC.1040303@gmail.com> Date: Wed, 05 Sep 2007 12:34:04 -0700 From: James M Snell User-Agent: Thunderbird 2.0.0.6 (X11/20070728) MIME-Version: 1.0 To: abdera-user@incubator.apache.org Subject: Re: Invalid byte 2 of 3-byte UTF-8 sequence. References: <20070904122343.116970@gmx.net> <46C9E0E2-2002-4B19-B83C-1226C9D03AC7@gmail.com> <20070904135931.174410@gmx.net> <6EEAEA4C-1776-46A8-994E-A6A57F9983C6@gmail.com> <46DD6B06.5010208@gmail.com> <9043128E-0480-4BB1-AAEB-B74129A3E253@gmail.com> <20070904165716.174430@gmx.net> <46DD9139.2080808@gmail.com> <12494891.post@talk.nabble.com> <46DEBBBE.8010505@gmail.com> <12505637.post@talk.nabble.com> <46DEDEED.2080701@gmail.com> In-Reply-To: X-Enigmail-Version: 0.95.3 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Oh, and if you are able to put together a patch file, please post it to jira. :-) - James Chris Berry wrote: > Greetings, > > We figured it out. AFAICT, both my issue and Herbert's are the same. > I believe this is a bug in Abdera. > > There are actually two issues; > > ----------------------- > First , Abdera uses HttpClient's > > method.getResponseBodyAsStream(); > > in order to obtain a raw stream bytes for Woodstox. (which is the > correct thing to do for performance) > > But Woodstox does NOT assume UTF-8. So it fails when parsing valid > UTF-8 characters. > > The fix is to change the following line in AbstractClientResponse > > public Document getDocument( Parser parser, > ParserOptions options) > throws ParseException { > try { > ....... > // Document doc = parser.parse( getInputStream(), base, options); > Document doc = parser.parse(getReader(), base, options); > .... > > And to add the following method to AbstractClientResponse > > public java.io.Reader getReader() throws java.io.IOException { > String header = getHeader("Content-Type"); > > String type = "UTF-8"; // default to UTF-8 > java.util.regex.Matcher matcher = > java.util.regex.Pattern.compile(".*charset\\s*\\=\\s*(\\S+).*").matcher(header); > > if (matcher.matches()) { > System.out.println("@@@@@@@@@@@@@@@@@@@@@@ type = " + type); > type = matcher.group(1); > } > > return new java.io.InputStreamReader(getInputStream(), type); > } > > Although, there is likely a cleaner way to get the "charset" param in > Abdera?? > > ----------------------------- > Second, Abdera is NOT adding the "charset" parameter (e.g. > ";charset=utf-8" ) to the Content-Type HTTP Header of the Response > > So a fix might be to change the following line in BaseResponseContext:: > > public BaseResponseContext(T base, boolean chunked) { > this.base = base; > setStatus(200); > setStatusText("OK"); > this.chunked = chunked; > try { > > // setContentType(getContentType().toString()); > setContentType(getContentType().toString() + "; charset=utf-8"); > > } catch (Exception e) {} > } > > Although there are likely better ways/places to accomplish this within > Abdera. > Perhaps I need to set this in my SpringAbderaServlet?? > > > I will add these details to the JIRA as well. > Thanks, > -- Chris > On Sep 5, 2007, at 11:53 AM, James M Snell wrote: > >> Hmmm... how odd. Ok, let me explore a bit further. >> >> - James >> >> herbert wrote: >>> Hi! >>> >>> I've already tried that before. >>> Using the escape sequence \u00e4 also does *not* work. >>> >>> Herbert > > S'all good --- chriswberry at gmail dot com > > > >