couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Duplicate fields in documents
Date Thu, 20 Feb 2014 18:40:00 GMT
So this is a wonderful example of the JSON "spec" being rather
underspecified. Most of the time for most languages this is not an
issue. Every parser I know of use some version of a language-native
hash-map to represent JSON objects. Erlang currently has no such
native mapping (I'll discuss maps in 17.0-rc1 below). The two closest
structures commonly used in Erlang are proplists and the dict module.
For whatever reason, proplists were decided to be *the* representation
for JSON in Erlang (disregarding some controversy around the specifics
of the proplist representation).

Its interesting on the historical interpretation on this spec as well.
For instance, Jens mentioned how order preservation is non-standard
which is kinda true but really not. Key ordering is actually even
worse in terms of general implementation consensus. JavaScript
iterates over the order that keys were defined. Python has a
deterministic-yet-undefined ordering. Ruby used to have a completely
random ordering (ie, iterating twice over the same keys could yield
different orderings, I think this has changed though).

These are all details that aren't in the spec but instead depend on
the language's native representation of a JSON object. Given that
Erlang uses a proplist, following the semantics of a proplist
shouldn't be a surprise. Yes, duplicate keys are different than most
implementations but its neither prohibited or has a clear "correct"
answer especially considering the RFC's further intent that if a JSON
text conforms to the grammar that it shouldn't be rejected (something
that Jiffy willfully violates in some circumstances).

To me, the answer to the original email in this thread is "If you
don't want duplicate keys in your JSON, don't put duplicate keys in
your JSON". Given that this is the first time I've actually seen this
as a user reported error I think that guidance has served users quite
well. I'd also be interested to hear what other serializer generated
duplicate keys.

Given all of that, Erlang 17.0-rc1 has a native hash-map-like data
structure available which will most likely become the standard
representation for JSON in Erlang (I would wager). Once Jiffy returns
values as maps then the duplicate key issue will match the behavior by
only keeping the last value for each duplicate key. Making that change
is going to be interesting and present a series of challenges (ie,
rewriting large swaths of CouchDB internals).

Given all that, I don't believe its an issue. It is a bit weird when
people first encounter the behavior, but given that its proven to not
be an issue in the last five years I'm not too inclined to change it
given the downsides of adding the check.

On Thu, Feb 20, 2014 at 3:37 AM, Dave Cottlehuber <dch@jsonified.com> wrote:
>> > On Feb 19, 2014, at 6:07 AM, Dave Cottlehuber wrote:
>> >
>> > > TL;DR the appropriately named ECMA 404 JSON spec [1] is broken or more
>> > politely, insufficiently specific.
>> >
>> On Wed, Feb 19, 2014 at 8:30 PM, Jens Alfke wrote:
>> >
>> > This seems to fall into the category of "things so obvious that the people
>> > who wrote the spec didn't realize they had to mention them." I.e. "You
>> > can't have duplicate keys."
>
> When I was in France, I learned that anything not explicitly forbidden is
> therefore permitted. And while in Germany, I learned that anything not
> explicitly permitted is forbidden :-). Luckily in New Zealand, it only matters
> if you get caught ;-).
>
> On 20. Februar 2014 at 06:14:29, Mikael Söderström (vimpyboy@msn.com) wrote:
>>
>> If there are duplicate keys, it should absolutely fail.
>
> I think returning an error on receiving duplicate keys would be a sensible
> change to CouchDB, albeit a relatively minor breaking one. See below.
>
>> > > JSON is typically based on a dictionary or hash map, and there's no
>> > particular reason for that data structure to enforce uniqueness of keys.
>> >
>> > I disagree. Mathematically, a dictionary/map object is a function: it maps
>> > from a set of keys to a set of values, with each key mapping to exactly one
>> > value. (That's basically the definition of 'function'.) It's certainly
>> > possible to create implementations that map a key to _multiple_ values, but
>> > that's something different: it's a mapping from a key to a set. (For
>> > example, it's not from string-->int, it's now from string-->set.) The
>> > JSON spec does not include this kind of mapping -- an object value in JSON
>> > can be a number, but not a set of numbers.
>
> I was referring to the fact that hash tables usually have some form of collision
> detection internally, when multiple keys map to the same hash bucket. There are
> perfect hash functions & algorithms that avoid collisions, but that's getting
> a bit off track. And it's moot point, as we all agree that duplicate keys are not
> what most people expect, including Crockford.
>
>> > IMHO the reasonable thing for a JSON parser to do if it encounters a
>> > duplicate key is to fail with a clear error. Failing that, the only other
>> > reasonable option is to discard one or the other value (I don't have an
>> > opinion which.) But keeping both is unreasonable.
>
> If somebody wants to sort this out, I'd suggest implementing the fix (in C) in
> Paul Davis' jiffy library, which has been on the list of things to import
> for a while.
>
> I added https://github.com/davisp/jiffy/issues/54 and updated
> https://issues.apache.org/jira/browse/COUCHDB-1294
>
> jiffy:decode(<< "{\"foo\":\"bar\", \"foo\":\"bar\"}" >>) should return an
error,
> something like invalid_json_duplicate_name_in_object.
>
> CC'd dev@.
>
> A+
> --
> Dave Cottlehuber
> Sent from my PDP11
>
>

Mime
View raw message