Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 77020 invoked from network); 29 Aug 2009 05:56:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Aug 2009 05:56:58 -0000 Received: (qmail 71651 invoked by uid 500); 29 Aug 2009 05:56:57 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 71564 invoked by uid 500); 29 Aug 2009 05:56:57 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 71554 invoked by uid 99); 29 Aug 2009 05:56:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Aug 2009 05:56:57 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Aug 2009 05:56:54 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id B9EB4234C044 for ; Fri, 28 Aug 2009 22:56:32 -0700 (PDT) Message-ID: <956176654.1251525392746.JavaMail.jira@brutus> Date: Fri, 28 Aug 2009 22:56:32 -0700 (PDT) From: "Curt Arnold (JIRA)" To: dev@couchdb.apache.org Subject: [jira] Commented: (COUCHDB-345) "High ASCII" can be inserted into db but not retrieved In-Reply-To: <944783679.1241648130398.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/COUCHDB-345?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D127= 49083#action_12749083 ]=20 Curt Arnold commented on COUCHDB-345: ------------------------------------- ISO-8859-1, Cp1252 and Latin-1 are near synonyms for encoding the first 256= character points in Unicode as single byte values and is incapable of repr= esenting any other character without some escape mechanism. Any arbitrary= set of bytes would be a valid ISO-8859-1 sequence and can be decoded into = a sequence of Unicode characters. UTF-8 is an variable byte encoding of the full Unicode character repertoire= . Character values from \u0000 to \u007F are represented as a single-byte,= while other characters require 2-6 bytes to encode. Unlike ISO-8859-1, no= t every sequence of bytes is valid and can be converted back to Unicode cha= racter points. If I remember correctly, any two-bytes in a row with the hi= gh-bit set is invalid. The test date in the last two cases are valid ISO-8= 859-1 sequence, but they can not be interpreted as UTF-8 since they contain= byte sequences that can not be converted back into Unicode code points. If it was just an encoding mismatch and the data was being misinterpreted, = you would lay the blame at the client. However, in this case, data can go = into the database that the rest of the stack can't process since it contain= s invalid sequences.=20 The RFC mentions the two variants of UTF-16 and UCS-4, however the ISO-8859= -1 sequences could not be interpreted using any of those encodings since th= e first two characters must be ASCII. There are only certain sequences of = bytes that could appear for JSON encoded in any of those encodings and the = byte sequences send in the last two cases don't match any of those patterns= . Sniffing the encoding would work in a similar manner to XML which is de= scribed in http://www.w3.org/TR/REC-xml/#sec-guessing. > "High ASCII" can be inserted into db but not retrieved > ------------------------------------------------------ > > Key: COUCHDB-345 > URL: https://issues.apache.org/jira/browse/COUCHDB-345 > Project: CouchDB > Issue Type: Bug > Affects Versions: 0.9 > Environment: OSX 10.5.6 > Reporter: Joan Touzet > Attachments: badtext.tar.gz, enctest.zip > > > It is possible to PUT/POST a document into CouchDB with a "high ASCII" va= lue that cannot be retrieved. This results from not escaping a non-ASCII va= lue into \u#### when PUT/POSTing the document. > The attached sample code will recreate the problem using the hex value D8= (=C3=98) in a possibly unsavoury test string. > Sample output against 0.9.0 is as follows: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > { > "ok": true > } > { > "id": "fail",=20 > "ok": true,=20 > "rev": "1-76726372" > } > { > "error": "ucs",=20 > "reason": "{bad_utf8_character_code}" > } > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Please note this defect turned up another problem, namely that the bad_ut= f8_character_code exception thrown by a design document attempting to map()= the bad document caused Futon to fail silently in building the view, with = no indication (except via debug log) that there was a failure. The log indi= cated two attempts to build the view, both failing, followed by an uncaught= exception error for Futon. > Based on this, there are likely other areas in the codebase that do not h= andle the bad_utf8_character_code exception correctly. > My belief is that CouchDB shouldn't accept this input and should have rej= ected the PUT/POST, or should have escaped the input itself before the inse= rtion. --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.