One problem that often bites me - someone forgets to include the UTF-8 charset in the Content-Type header.  Missing that can often mangle the handling of high byte characters.

When setting your Content-Type with curl this is often done something like:

curl -H "Content-Type: application/json; charset=utf-8" .... 

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International




On Jun 8, 2011, at 9:35 AM, Paul Davis wrote:

On Wed, Jun 8, 2011 at 12:32 PM, MK <mk@cognitivedissonance.ca> wrote:
Is there any intention to fix couch's handling of "unusual" unicode
characters?  One of the "unusual" characters is the right single quote
(226,128,153) which is a valid utf8 character and also not very
"unusual" IMO.

I have an interface which allows users to add and edit text in a db
document (again, not very unusual) and this one came up because of
someone cutting and pasting some text from a source which used the
right single quote as an apostrophe (which is just plain common -- in
fact they are used in the online "Definitive Guide").

So I am having to maintain a switch statement which filters out these
characters and replaces them with html entities before they get sent
to couch, which is okay in my case since the documents are just being
used as html pages anyway.

But it's an awkward and unnecessary solution: individual
developers should not have to be dealing with this, proper utf8
handling should be hard coded into couch.   For one thing, it means that
anyone worried about such "unusual" possibilities cannot use
couchapp or couch directly -- data has to be filtered first server side.
Although spidermonkey handles utf8 fine, depending on client side
filtering is not always an alternative.

Sincerely, MK

--
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)



What version of CouchDB are you using and what is an actual request look like?

A recent check on trunk shows both decoders handle your case fine:

1> mochijson2:decode(<<"\"", 226,128,153, "\"">>).
<<226,128,153>>
2> ejson:decode(<<"\"", 226,128,153, "\"">>).
<<226,128,153>>
3>