incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Davis" <paul.joseph.da...@gmail.com>
Subject Re: Can I guarantee uniqueness in a field without using _id?
Date Tue, 13 Jan 2009 04:44:34 GMT
On Mon, Jan 12, 2009 at 11:20 PM, Sunny Hirai <thesunny@gmail.com> wrote:
> Thanks for all the responses.
>
> Just to clarify, I know that CouchDB is not relational and I know the
> primary differences and limitations; however, I still have some questions if
> you will permit.
>

We permit questions. :) Especially well thought out ones like you have below.

> While it could be noted as a weakness of my implementation, the other thing
> to note is that _id can then no longer be used generically. For example, I
> can not include a reference from another SQL database unless I make the
> reference a very long String encoded with all the unique values which seems
> to be a bad way to relate tables/databases.
>

Hence the MD5 suggestion. It's a length limited string that (while
probablistically, similar to UUIDs) guarantee global identity.

> Note that there is no way to handle two unique fields (e.g. "name" and
> "email" both unique).
>

Well, MD5 the string representation of them would be fine.
(Assumptions are obvious)

> I know that CouchDB has different pros and cons from relational databases
> and I'm okay with there being cons. I just want to make sure that what I'm
> asking is (a) impossible to do because of the way CouchDB works or (b) a
> design choice that is not constrained by CouchDB.
>
> The reason I ask is that there appears to be some sort of a lock somewhere
> to assure that you don't end up with two documents with the same id. For
> example, if you PUT two new documents at exactly the same time in the same
> server, one of them will fail because it will not have a "_rev". This is
> actually an assertion, I'm not sure to be true.
>

Let me start with the fact that this is an assertion that is mostly
true and only became slightly untrue in the last week or so. More on
this below.

> I understand that two documents PUT into two different servers that
> replicate the same database can conflict; however, can two of the documents
> conflict in the same database?
>
> If they CAN conflict, then the guaranteed uniqueness on a single server is
> not actually guaranteed upon a successful insert. They can both "succeed"
> but they can be in conflict. This could be bad for doing things like
> creating a new user as two users can be granted the same name successfully
> but only one actually gets it. In this case, the solutions to use _id to
> guarantee a unique name can actually fail anyways, even though it may be
> rare.
>

You're hitting on a part of the implementation that probably solely
resides in Damien's head at this moment. I haven't seen the end
implementation so I'll only be able to give you my best guess on what
will happen in the coming days/weeks/months.

> On the other hand, if _id CANNOT conflict within the same server, then it
> appears there is some sort of lock somewhere. It might be very light, or
> small, or whatever, but then there is a lock.
>

Its optional now with the "X-Couch-Full-Commit: true" header will
ensure a full commit.

> So, in other words, I would like to know which one is true:
>
> A. there can be conflicts _id conflicts on the same server. In that case,
> _id doesn't guarantee uniqueness in the sense that two records can be
> inserted successfuly, but only one is authoritative. Then I have to deal
> with this somehow anyways.
>
> B. there aren't conflicts on the same server so you are guaranteed
> uniqueness on the same server. The _id hack always works. In this case could
> we not consider a similar situation to guarantee unique fields, perhaps in
> the far (far) future? Even if not, I'd like to know that there can be no
> conflicts on the same server.
>
> C. Something else completely that allows both a conflict-free _id in a
> manner that is simultaneously lock free that I haven't thought of.
>

This is all from memory without reading or using any of this new code
yet, but the situation is something like the following. Remember, I'm
not entirely certain on all these things, its 11:33, and I've had
beers. Please no pointing and laughing.

Briefly:

Old school style: Single node couchdb ensured global uniqueness when
using PUT. When using POST to _bulk_docs there were transactional
semantics, if one of the docs failed all failed.

New school style: Giving transactional semantics on _bulk_docs is
inefficient to do when contemplating multi-node setups. CouchDB
multi-node setups refers to having the transparent Couch automagically
hashes documents and distributes accordingly.

Uncertain style: Damien commited code to make the transaction
semantics optional using a header for the request. This was presented
in terms of _bulk_docs. I have no idea how it affects PUT semantics on
a single node or otherwise.

Certainly muddy waters uncertain style: Given that I have no idea on
the specifics of the header flag, if it's specified then I would be
running under the assumption that you will get a notification that
something at least conflicted or it might fail the request.

Moving on...

So the idea is that you're either going to wait for a possibly super
long time for a transaction, or write code that deals with conflicts.
The recommendation is that you write code that deals with conflicts.

I'm sure I futzed something in there, so wait for corrections before
you come to any grand conclusins ;)

HTH,
Paul Davis

> Thanks for the feedback.
>
> Sunny
>

Mime
View raw message