From user-return-16577-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Wed Jun 8 19:28:38 2011 Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 13770495C for ; Wed, 8 Jun 2011 19:28:38 +0000 (UTC) Received: (qmail 29683 invoked by uid 500); 8 Jun 2011 19:28:36 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 29648 invoked by uid 500); 8 Jun 2011 19:28:36 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 29640 invoked by uid 99); 8 Jun 2011 19:28:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jun 2011 19:28:35 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [128.18.84.113] (HELO mailgate-internal3.sri.com) (128.18.84.113) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 08 Jun 2011 19:28:25 +0000 Received: from brightmail-internal1.sri.com (128.18.84.121) by mailgate-internal3.sri.com with SMTP; 8 Jun 2011 19:28:03 -0000 X-AuditID: 80125479-b7c61ae000000cd9-98-4defcd43466c Received: from mars.esd.sri.com (mars.esd.sri.com [128.18.26.200]) by brightmail-internal1.sri.com (Symantec Brightmail Gateway) with SMTP id 7D.E9.03289.34DCFED4; Wed, 8 Jun 2011 12:28:03 -0700 (PDT) MIME-version: 1.0 Received: from [192.12.16.187] by mars.esd.sri.com (Sun Java(tm) System Messaging Server 6.3-8.05 (built Sep 1 2009; 64bit)) with ESMTPSA id <0LMH0078GKQMKV20@mars.esd.sri.com> for user@couchdb.apache.org; Wed, 08 Jun 2011 12:27:58 -0700 (PDT) From: Jim Klo Content-type: multipart/signed; boundary=Apple-Mail-214-689085180; protocol="application/pkcs7-signature"; micalg=sha1 Subject: Re: when will utf8 handling be fixed? Date: Wed, 08 Jun 2011 12:28:03 -0700 In-reply-to: To: user@couchdb.apache.org References: <20110608123200.dd4dd230.mk@cognitivedissonance.ca> Message-id: <3418B327-52BF-48E8-B5E4-DC27D7081989@sri.com> X-Mailer: Apple Mail (2.1084) X-Brightmail-Tracker: AAAAAA== X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-214-689085180 Content-Type: multipart/alternative; boundary=Apple-Mail-213-689085148 --Apple-Mail-213-689085148 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii One problem that often bites me - someone forgets to include the UTF-8 = charset in the Content-Type header. Missing that can often mangle the = handling of high byte characters. When setting your Content-Type with curl this is often done something = like: curl -H "Content-Type: application/json; charset=3Dutf-8" ....=20 Jim Klo Senior Software Engineer Center for Software Engineering SRI International On Jun 8, 2011, at 9:35 AM, Paul Davis wrote: > On Wed, Jun 8, 2011 at 12:32 PM, MK wrote: >> Is there any intention to fix couch's handling of "unusual" unicode >> characters? One of the "unusual" characters is the right single = quote >> (226,128,153) which is a valid utf8 character and also not very >> "unusual" IMO. >>=20 >> I have an interface which allows users to add and edit text in a db >> document (again, not very unusual) and this one came up because of >> someone cutting and pasting some text from a source which used the >> right single quote as an apostrophe (which is just plain common -- in >> fact they are used in the online "Definitive Guide"). >>=20 >> So I am having to maintain a switch statement which filters out these >> characters and replaces them with html entities before they get sent >> to couch, which is okay in my case since the documents are just being >> used as html pages anyway. >>=20 >> But it's an awkward and unnecessary solution: individual >> developers should not have to be dealing with this, proper utf8 >> handling should be hard coded into couch. For one thing, it means = that >> anyone worried about such "unusual" possibilities cannot use >> couchapp or couch directly -- data has to be filtered first server = side. >> Although spidermonkey handles utf8 fine, depending on client side >> filtering is not always an alternative. >>=20 >> Sincerely, MK >>=20 >> -- >> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe) >> "The angel of history[...]is turned toward the past." (Walter = Benjamin) >>=20 >>=20 >=20 > What version of CouchDB are you using and what is an actual request = look like? >=20 > A recent check on trunk shows both decoders handle your case fine: >=20 > 1> mochijson2:decode(<<"\"", 226,128,153, "\"">>). > <<226,128,153>> > 2> ejson:decode(<<"\"", 226,128,153, "\"">>). > <<226,128,153>> > 3> --Apple-Mail-213-689085148 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii One = problem that often bites me - someone forgets to include the UTF-8 = charset in the Content-Type header.  Missing that can often mangle = the handling of high byte characters.

When setting = your Content-Type with curl this is often done something = like:

curl -H "Content-Type: application/json; = charset=3Dutf-8" .... 

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI = International




On Jun 8, 2011, at 9:35 AM, Paul Davis wrote:

On = Wed, Jun 8, 2011 at 12:32 PM, MK <mk@cognitivedissonance.ca>= ; wrote:
Is there any intention to fix = couch's handling of "unusual" unicode
characters?  One of the "unusual" characters is the = right single quote
(226,128,153)= which is a valid utf8 character and also not = very
"unusual" = IMO.

I have an = interface which allows users to add and edit text in a = db
document (again, not very = unusual) and this one came up because of
someone cutting and pasting some text from a source which = used the
right single quote as = an apostrophe (which is just plain common -- = in
fact they are used in the = online "Definitive Guide").

So I am having = to maintain a switch statement which filters out = these
characters and replaces = them with html entities before they get sent
to couch, which is okay in my case since the documents are = just being
used as html pages = anyway.

But it's an = awkward and unnecessary solution: individual
developers should not have to be dealing with this, proper = utf8
handling should be hard = coded into couch.   For one thing, it means = that
anyone worried about such = "unusual" possibilities cannot use
couchapp or couch directly -- data has to be filtered = first server side.
Although = spidermonkey handles utf8 fine, depending on client = side
filtering is not always = an alternative.

Sincerely, = MK

--
"Enthusiasm = is not the enemy of the intellect." (said of Irving = Howe)
"The angel of = history[...]is turned toward the past." (Walter = Benjamin)



What version of CouchDB are you using = and what is an actual request look like?

A recent check on trunk = shows both decoders handle your case fine:

1> = mochijson2:decode(<<"\"", 226,128,153, = "\"">>).
<<226,128,153>>
2> = ejson:decode(<<"\"", 226,128,153, = "\"">>).
<<226,128,153>>
3>

= --Apple-Mail-213-689085148-- --Apple-Mail-214-689085180 Content-Disposition: attachment; filename=smime.p7s Content-Type: application/pkcs7-signature; name=smime.p7s Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJ/zCCBO0w ggRWoAMCAQICEBZ7jcIF++u6rxPdCkJYyG0wDQYJKoZIhvcNAQEFBQAwgdgxCzAJBgNVBAYTAlVT MRowGAYDVQQKExFTUkkgSW50ZXJuYXRpb25hbDEfMB0GA1UECxMWVmVyaVNpZ24gVHJ1c3QgTmV0 d29yazE7MDkGA1UECxMyVGVybXMgb2YgdXNlIGF0IGh0dHBzOi8vd3d3LnZlcmlzaWduLmNvbS9y cGEgKGMpMDIxMDAuBgNVBAsTJ0NsYXNzIDIgT25TaXRlIEluZGl2aWR1YWwgU3Vic2NyaWJlciBD QTEdMBsGA1UEAxMUU1JJIEludGVybmF0aW9uYWwgQ0EwHhcNMTEwMTE3MDAwMDAwWhcNMTIwMTE3 MjM1OTU5WjCBwjEaMBgGA1UEChQRU1JJIEludGVybmF0aW9uYWwxKDAmBgNVBAsUH0luZm9ybWF0 aW9uIFRlY2hub2xvZ3kgU2VydmljZXMxRjBEBgNVBAsTPXd3dy52ZXJpc2lnbi5jb20vcmVwb3Np dG9yeS9DUFMgSW5jb3JwLiBieSBSZWYuLExJQUIuTFREKGMpOTkxEjAQBgNVBAMTCUphbWVzIEts bzEeMBwGCSqGSIb3DQEJARYPamltLmtsb0BzcmkuY29tMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8A MIIBCgKCAQEA7X0QQ3Ag7/cRBwEgKfDEaOgXLwvnLhzmgY1bon3wSEK/ezUhlPhw8X/O4krsRp9v GKHAS5Z29ix+6B+PHJI3aptqCfaCT3ffu6MWFIyAhNaFNdvRBy8MhsD5lvjRffA7oysddhLWJ9AV madJBXjf0Fl+qoS/q0MbjsZSrQHeizYcv91SxcsWovgM6XoY87v0o7tHzUBWEr6jEOrvz50XKB8m pytqWAR8zLkp0NmsdgNk/PX6yXA3T4rPS690WOV3EDGK8pum2DIG7B319/lVeFQPdKdjNGpSivVt GXtT1W/KtpzW8Olmkn1sprupVOZXsKLU/MFXYPoFdR4pXoYViwIDAQABo4IBRjCCAUIwCQYDVR0T BAIwADCBrAYDVR0gBIGkMIGhMIGeBgtghkgBhvhFAQcXAjCBjjAoBggrBgEFBQcCARYcaHR0cHM6 Ly93d3cudmVyaXNpZ24uY29tL0NQUzBiBggrBgEFBQcCAjBWMBUWDlZlcmlTaWduLCBJbmMuMAMC AQEaPVZlcmlTaWduJ3MgQ1BTIGluY29ycC4gYnkgcmVmZXJlbmNlIGxpYWIuIGx0ZC4gKGMpOTcg VmVyaVNpZ24wCwYDVR0PBAQDAgWgMBEGCWCGSAGG+EIBAQQEAwIHgDBmBgNVHR8EXzBdMFugWaBX hlVodHRwOi8vb25zaXRlY3JsLnZlcmlzaWduLmNvbS9TUklJbnRlcm5hdGlvbmFsSW5mb3JtYXRp b25UZWNobm9sb2d5U2VydmljZXMvTGF0ZXN0Q1JMMA0GCSqGSIb3DQEBBQUAA4GBACf3MlYS4ssw EUnHTKP+v6xeJSPicFWwgYzS0iBOsuCpgUTTOSxPSPBwFNxY/plPMikXkK6rTGiIQUFXK59uqPV+ /1xXFpqvvt9/c0CqQDr8EfbbycaFyN8FaXQNV0gaqXDr/m4X2GZovm85T3osCKWzIijQzmr9xrQK 5yjpnTt3MIIFCjCCBHOgAwIBAgIQdRD9LNvKRXBSboyDbAKnbDANBgkqhkiG9w0BAQUFADCBwTEL MAkGA1UEBhMCVVMxFzAVBgNVBAoTDlZlcmlTaWduLCBJbmMuMTwwOgYDVQQLEzNDbGFzcyAyIFB1 YmxpYyBQcmltYXJ5IENlcnRpZmljYXRpb24gQXV0aG9yaXR5IC0gRzIxOjA4BgNVBAsTMShjKSAx OTk4IFZlcmlTaWduLCBJbmMuIC0gRm9yIGF1dGhvcml6ZWQgdXNlIG9ubHkxHzAdBgNVBAsTFlZl cmlTaWduIFRydXN0IE5ldHdvcmswHhcNMDIwOTIzMDAwMDAwWhcNMTIwOTIyMjM1OTU5WjCB2DEL MAkGA1UEBhMCVVMxGjAYBgNVBAoTEVNSSSBJbnRlcm5hdGlvbmFsMR8wHQYDVQQLExZWZXJpU2ln biBUcnVzdCBOZXR3b3JrMTswOQYDVQQLEzJUZXJtcyBvZiB1c2UgYXQgaHR0cHM6Ly93d3cudmVy aXNpZ24uY29tL3JwYSAoYykwMjEwMC4GA1UECxMnQ2xhc3MgMiBPblNpdGUgSW5kaXZpZHVhbCBT dWJzY3JpYmVyIENBMR0wGwYDVQQDExRTUkkgSW50ZXJuYXRpb25hbCBDQTCBnzANBgkqhkiG9w0B AQEFAAOBjQAwgYkCgYEAzvnUwmuZmBSSAFVb0qoC0hhUL1a6f+AIHw5UpxW5oRTjsDtUzsCa+6Yg GvKUlisrnI/tPZFrupvHVNQjRj05fhHiABFinwlnCA7J80x3gZlBMwHrgoKYribJ1GTVmc1R0FmA B4KYzBeZjJZiNpqLEsEb0ORdzJYb2/UZazjL/fkCAwEAAaOCAegwggHkMBIGA1UdEwEB/wQIMAYB Af8CAQAwRAYDVR0gBD0wOzA5BgtghkgBhvhFAQcXAjAqMCgGCCsGAQUFBwIBFhxodHRwczovL3d3 dy52ZXJpc2lnbi5jb20vcnBhMDQGA1UdHwQtMCswKaAnoCWGI2h0dHA6Ly9jcmwudmVyaXNpZ24u Y29tL3BjYTItZzIuY3JsMAsGA1UdDwQEAwIBBjARBglghkgBhvhCAQEEBAMCAQYwKAYDVR0RBCEw H6QdMBsxGTAXBgNVBAMTEFByaXZhdGVMYWJlbDItODIwHQYDVR0OBBYEFC1OfgnwbUVBEaxx2j87 9iZKf2RkMIHoBgNVHSMEgeAwgd2hgcekgcQwgcExCzAJBgNVBAYTAlVTMRcwFQYDVQQKEw5WZXJp U2lnbiwgSW5jLjE8MDoGA1UECxMzQ2xhc3MgMiBQdWJsaWMgUHJpbWFyeSBDZXJ0aWZpY2F0aW9u IEF1dGhvcml0eSAtIEcyMTowOAYDVQQLEzEoYykgMTk5OCBWZXJpU2lnbiwgSW5jLiAtIEZvciBh dXRob3JpemVkIHVzZSBvbmx5MR8wHQYDVQQLExZWZXJpU2lnbiBUcnVzdCBOZXR3b3JrghEAuS9g zIifoXpGCbhbcGyKrzANBgkqhkiG9w0BAQUFAAOBgQAowFJw4GZ/4dbI1ncxPAvPGrV/aIB5Z8mZ e9tmn/CH+OcKSVI02h/Q5qbUD+P2hWMW3hBaQeCUG/YMWDgUXXEQKSeZYVGLpGdxkSAzV8VOQLIG JX3/1Lo4oo067Z8qZ0NLf6IH2SzZDEcDuFHGuc5Z0OM3Cghvwo6OX1oO37MiszGCBHswggR3AgEB MIHtMIHYMQswCQYDVQQGEwJVUzEaMBgGA1UEChMRU1JJIEludGVybmF0aW9uYWwxHzAdBgNVBAsT FlZlcmlTaWduIFRydXN0IE5ldHdvcmsxOzA5BgNVBAsTMlRlcm1zIG9mIHVzZSBhdCBodHRwczov L3d3dy52ZXJpc2lnbi5jb20vcnBhIChjKTAyMTAwLgYDVQQLEydDbGFzcyAyIE9uU2l0ZSBJbmRp dmlkdWFsIFN1YnNjcmliZXIgQ0ExHTAbBgNVBAMTFFNSSSBJbnRlcm5hdGlvbmFsIENBAhAWe43C Bfvruq8T3QpCWMhtMAkGBSsOAwIaBQCgggJiMBgGCSqGSIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJ KoZIhvcNAQkFMQ8XDTExMDYwODE5MjgwM1owIwYJKoZIhvcNAQkEMRYEFFCpndZs5tozcuXXf5S0 GfVZmbD3MIH+BgkrBgEEAYI3EAQxgfAwge0wgdgxCzAJBgNVBAYTAlVTMRowGAYDVQQKExFTUkkg SW50ZXJuYXRpb25hbDEfMB0GA1UECxMWVmVyaVNpZ24gVHJ1c3QgTmV0d29yazE7MDkGA1UECxMy VGVybXMgb2YgdXNlIGF0IGh0dHBzOi8vd3d3LnZlcmlzaWduLmNvbS9ycGEgKGMpMDIxMDAuBgNV BAsTJ0NsYXNzIDIgT25TaXRlIEluZGl2aWR1YWwgU3Vic2NyaWJlciBDQTEdMBsGA1UEAxMUU1JJ IEludGVybmF0aW9uYWwgQ0ECEBZ7jcIF++u6rxPdCkJYyG0wggEABgsqhkiG9w0BCRACCzGB8KCB 7TCB2DELMAkGA1UEBhMCVVMxGjAYBgNVBAoTEVNSSSBJbnRlcm5hdGlvbmFsMR8wHQYDVQQLExZW ZXJpU2lnbiBUcnVzdCBOZXR3b3JrMTswOQYDVQQLEzJUZXJtcyBvZiB1c2UgYXQgaHR0cHM6Ly93 d3cudmVyaXNpZ24uY29tL3JwYSAoYykwMjEwMC4GA1UECxMnQ2xhc3MgMiBPblNpdGUgSW5kaXZp ZHVhbCBTdWJzY3JpYmVyIENBMR0wGwYDVQQDExRTUkkgSW50ZXJuYXRpb25hbCBDQQIQFnuNwgX7 67qvE90KQljIbTANBgkqhkiG9w0BAQEFAASCAQDiIeg3oSgOW0kZQ0/FLty6ynfez+dqmzP1Qf5f UKLPsVaysT0SlzyvKHAnlEockxBDxZefOkXel8q8JNSehdfdt6pnL8OqV4rkeExj3gD5sWnfSAmX KYLo//PKC3v68S1iezHtYhuXecfmvRDthziyUSNA+WWv5O1EvxaC/dehqRVL9onqoA/JBAVxAn2v 9Xf4i9AsHxR5SFaZn82CZQJ79d0KQ8qY5Y+MU/a182bJ7rRg2upD/ahaJFAp9jVE9ssgthr5VIh0 jAXwkhgTgzIfDKQ8Nq68cY+BF5rYHjV4lmsGhMwNpFvoXeBI1ZVpkK9UzyMd3SOJqjtMP7nQBED7 AAAAAAAA --Apple-Mail-214-689085180--