couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Mon, 22 Jun 2009 20:22:10 GMT
On Mon, Jun 22, 2009 at 3:32 PM, Noah Slater<nslater@apache.org> wrote:
> On Mon, Jun 22, 2009 at 03:15:24PM -0400, Paul Davis wrote:
>> I think he means optimization in so much as that the deterministic
>> revision algorithm still works regardless.
>
> For what definition of "works" though?
>
> Should two versions of the same document, one using combining code points and
> the other using single code points, generate the same hash?

I dunno. I would lean towards the revision system not caring about
unicode and not caring about canonical JSON. If there's a desire for
it we could provide an end point that does normalization in the same
style that we provide a uuids endpoint for clients with no good source
of entropy.

>  If the answer is no,
> then why don't we drop the pretence of working with a canonical version of the
> document and just SHA1 the binary serialisation?
>

Right. Quoting myself from the original thread:

Ie, instead of calling it canonical JSON just call it, "The CouchDB
deterministic revision
algorithm" or some such.

>> If clients want to avoid spurious conflicts, then they should send normalized
>> unicode to avoid the issue.
>
> You're arguing for using a blind a binary hash like SHA1, and putting the onus
> on clients to perform canonicalisation. That's fine, as long as you realise that
> this isn't what was originally being proposed.
>
>> In other words, if we just write the algorithm to not care about normalization
>> it'll solve lots of cases for free and the cases that aren't solved can be
>> solved if the client so desires.
>
> I'm not so sure what this "solves" other than implementation effort for us.
>

Perhaps solve is the wrong verb. I'm saying that in my experience,
alot of the use cases of deterministic hashing would work correctly
without fretting about unicode normalization. As Chris points out, we
may have conflicts in cases where normalization would have prevented
them, but as it is, those are already conflicts no matter what. Also,
if a client wants to go the extra mile and provide normalized unicode
then they will even avoid this situation, but not at the expense of
neglecting any client that is unable to provide normalized unicode due
to environment.

> Again, it seems like you're wanting to offer to routes for us:
>
>  * calculate the document hash from a canonical binary serialisation
>
>  * calculate the document hash from the binary serialisation
>
> Both of these are reasonable choices, depending on our goals. But we need to
> realise that JSON canonicalisation REQUIRES Unicode canonicalisation, and so the
> choice isn't about ignoring Unicode issues, it's about deciding what a canonical
> serialisation hash buys us above and beyond a totally blind one.
>

Exactly, though I would add a third choice that is

 * calculate the document hash from the deterministic binary serialization

Which would include requirements like serializing document members
with some defined ordering.

On a side note, I've also contemplated just hashing the incoming
binary representation as the new revision. Though that comes with its
own set of issues obviously.

>> The thing that worries me most about normalization is that we would end up
>> causing more problems by being complete than if we just took the naive route.
>> Requiring that a client have an implementation of normalization byte identical
>> to the one CouchDB uses instead of just being internally consistent seems like
>> it could trip up alot of clients.
>
> Interestingly enough, your suggestion to use a blind binary hash such as SHA1
> pushes the canonicalisation issues onto the client, and would force them to find
> a complete Unicode library. If CouchDB calculated the hash from a canonical
> serialisation internally, we remove this burden from the clients.
>

Time to back up a bit I think.

For me there are basically two main use cases for which we want
deterministic revisioning:

1. A client writes the same edit to multiple couchdb nodes. When the
nodes replicate, no spurious conflict is introduced.
2. A client has knowledge of a document, and want's to be able to
calculate new revisions client side.

For clarity, I'll call the first example the 'safe-writes' example,
and the second will be 'algorithm-implementer'.

Obviously, regardless of what we choose to implement, safe-writes are
kosher because all the server is the only one ever doing the actual
calculation. The unicode normalization issue crops up if two different
clients write the same edit to multiple hosts. If the clients don't
use the same normalization scheme then we still introduce the spurious
conflicts (that would currently be introduced no matter what) on
replication.

For the algorithm-implementer example if we require normalziation,
then all clients wishing to calculate revisions must include unicode
normalization.

What worries me is that the algorithm-implementers would have a
disproportionately more complicated implementation, especially when
the benefit to the safe-writes example could be slim to none assuming
a homogeneous client implementation.

A trade off would be to have an option to do normalization in CouchDB.
Whether it be a flag when writing records or a URL endpoint.

I'm tired of typing.

> Best,
>
> --
> Noah Slater, http://tumbolia.org/nslater
>

Mime
View raw message