couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MK ...@cognitivedissonance.ca>
Subject Re: when will utf8 handling be fixed?
Date Wed, 08 Jun 2011 20:20:52 GMT
On Wed, 8 Jun 2011 12:35:57 -0400
Paul Davis <paul.joseph.davis@gmail.com> wrote:
> On Wed, Jun 8, 2011 at 12:32 PM, MK <mk@cognitivedissonance.ca> wrote:
> > Is there any intention to fix couch's handling of "unusual" unicode
> > characters?  One of the "unusual" characters is the right single
> > quote (226,128,153) which is a valid utf8 character and also not
> > very "unusual" IMO.

> What version of CouchDB are you using and what is an actual request
> look like?

1.0.2 built a few weeks ago.   

I tried to replicate this simply using curl PUT and a copy of the
request dumped from node, that works okay.  Ie, yep, couch deals with
the multi-byte, and it is in the stdout csv decimal dump.

So I took the csv decimal dump from couch in debug mode, turned it back
into bytes, and diff'd it with the request.

The difference: the last couple of bytes are not in the couch csv dump,
such as the closing }, which would make the json invalid.  Otherwise it
is identical to the curl request, which goes through.

Watching the transfer on wireshark, however, couch does receive those
last few bytes, so *it was not truncated by me or node*.

Go figure.

> A recent check on trunk shows both decoders handle your case fine:

I have no idea what decoders you are referring to.   Anyway, for
posterity, here's the issue:

- Client sends utf8 data to node.
- Node passes data on to couch via http (Content-type is
application/x-www-form-urlencoded, identical to that used by curl).
- Couch rejects data with multi-byte character, csv decimal dump is
missing bytes that were in the transmission.

But even to me this sounds dubious, considering an identical request
from curl is fine...all I can say is that what makes a difference is a
switch with this in node:

case "\u2019": rv += "&rsquo;"; 

That's the last thing I do before the PUT.  If I leave the multi-byte
in, there's an issue.

MK

-- 
"Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
"The angel of history[...]is turned toward the past." (Walter Benjamin)


Mime
View raw message