couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark J. Reed" <markjr...@gmail.com>
Subject Re: when will utf8 handling be fixed?
Date Wed, 08 Jun 2011 20:26:52 GMT
The content-length Is bytes.  Sounds like your client is sending a
character count instead.

On Wednesday, June 8, 2011, MK <mk@cognitivedissonance.ca> wrote:
> On Wed, 8 Jun 2011 12:35:57 -0400
> Paul Davis <paul.joseph.davis@gmail.com> wrote:
>> On Wed, Jun 8, 2011 at 12:32 PM, MK <mk@cognitivedissonance.ca> wrote:
>> > Is there any intention to fix couch's handling of "unusual" unicode
>> > characters?  One of the "unusual" characters is the right single
>> > quote (226,128,153) which is a valid utf8 character and also not
>> > very "unusual" IMO.
>
>> What version of CouchDB are you using and what is an actual request
>> look like?
>
> 1.0.2 built a few weeks ago.
>
> I tried to replicate this simply using curl PUT and a copy of the
> request dumped from node, that works okay.  Ie, yep, couch deals with
> the multi-byte, and it is in the stdout csv decimal dump.
>
> So I took the csv decimal dump from couch in debug mode, turned it back
> into bytes, and diff'd it with the request.
>
> The difference: the last couple of bytes are not in the couch csv dump,
> such as the closing }, which would make the json invalid.  Otherwise it
> is identical to the curl request, which goes through.
>
> Watching the transfer on wireshark, however, couch does receive those
> last few bytes, so *it was not truncated by me or node*.
>
> Go figure.
>
>> A recent check on trunk shows both decoders handle your case fine:
>
> I have no idea what decoders you are referring to.   Anyway, for
> posterity, here's the issue:
>
> - Client sends utf8 data to node.
> - Node passes data on to couch via http (Content-type is
> application/x-www-form-urlencoded, identical to that used by curl).
> - Couch rejects data with multi-byte character, csv decimal dump is
> missing bytes that were in the transmission.
>
> But even to me this sounds dubious, considering an identical request
> from curl is fine...all I can say is that what makes a difference is a
> switch with this in node:
>
> case "\u2019": rv += "&rsquo;";
>
> That's the last thing I do before the PUT.  If I leave the multi-byte
> in, there's an issue.
>
> MK
>
> --
> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
> "The angel of history[...]is turned toward the past." (Walter Benjamin)
>
>

-- 
Mark J. Reed <markjreed@gmail.com>

Mime
View raw message