couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noah Diewald <noah.diew...@gmail.com>
Subject Re: CouchDB View Unicode Document
Date Thu, 28 Apr 2011 22:37:00 GMT
On Thu, Apr 28, 2011 at 5:19 PM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> On Thu, Apr 28, 2011 at 5:57 PM, Noah Diewald <noah.diewald@gmail.com> wrote:
>>> Can someone paste some actual input/output pairs so I have a clue
>>> what's going on.
>>>
>>> Theoretically \uFFFF isn't a valid escape sequence last I checked
>>> (don't get me started on 4627 idiocy).
>>>
>>> The JSON encoder will by default escape data that is non-printable
>>> ascii. The few special cased characters mentioned in the JSON spec are
>>> backslash escaped (\t \n \" etc) while All other bits are escaped as
>>> \uHHHH sequences.
>>
>> What you're describing is what I'm seeing. I don't think it is a bug,
>> just something I don't like because it isn't taking advantage of the
>> benefits of unicode. I'd rather see the characters instead of \uHHHH
>> sequences. For instance I get "\u00e9" for "é". I guess the JSON spec
>> says that any character can be escaped but characters in the basic
>> multilingual plane don't need to be because the string is utf8. I
>> guess I feel that the benefit of utf8 is supposed to be that escaping
>> these characters isn't necessary but that they'll appear in an easily
>> human readable form. I think from what you said above that I'm not
>> experiencing anything that is unexpected but I can supply some input
>> and output if it is.
>>
>> --
>> Noah Diewald
>> noah.diewald.me
>> noahsarchive.net
>>
>
> You are exactly correct. I think the general fear with escaping UTF-8
> is to make it easier for the JSON to pass through broken
> implementations that don't pay attention to possible UTF-8 in string
> data. It's possible to throw make that sort of thing configurable but
> that would entail quite a bit of consideration on a couple different
> fronts.
>

Yes, that makes sense. We do not live in a perfect world. It would be
cool if when "Accept-Charset: utf-8" were used that it might alter the
behavior and allow the characters through unescaped but I can see how
this wouldn't be a high priority since the current behavior is simple
and works for everyone.

-- 
Noah Diewald
noah.diewald.me
noahsarchive.net

Mime
View raw message