couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Kumar <suraj.ku...@inmobi.com>
Subject Modeling Relationships and providing Transactional Integrity
Date Thu, 10 Apr 2014 12:54:18 GMT
[warning: cross-posted]

Hi,

We're attempting to build a model of a large scale, complex Infrastructure.
That means, every machine their supporting machines report to mothership.
Since our problem is truly that of high concurrency, choosing a solid data
base to keep state of this model became the focus in our erstwhile days. We
zero'ed in on CouchDB: actually, due to the fact that there is Erlang
powering it and that we can pull off other things (not met by CouchDB)
which Couch doesn't provide. One of those things was the notion of
Relationships.

What do I mean by "Relationships" really? Some "types" of Entities have
attributes which may potentially be related some other "types" of Entities
in specific known ways (1:1, 1:*, *:1).

The "Type" becomes the hazy part for schemaless systems like CouchDB.
However, let us now talk in Couch primitives.

Let us set aside the question of how this could potentially still result in
inconsistency in a live distributed database... and imagine if there could
be 'design' documents that describe how some attributes of some "types" of
documents are related to some other attributes of some other "types" of
documents. Imagine, if this could be used by this new 'Relationships'
engine to automatically validate and keep relational integrity of the
database. To describe in couch-terminology, it is a way to automatically
modify certain keys of related document whenever certain keys of a given
'type' of document changes.

I'm now attempting to formally describe two of the basic primitive elements
of every practical schemaless database system, specifically CouchDB:

1. Documents of classifiable 'types' or 'sets'.
2. Attributes (*keys of the JSON hash*) (and a way to address attributes
using a generic, intuitive and a standard "*convention*")

I am of the belief that defining these two formally is the first step to
approach implementing Relationships in CouchDB as a usable general purpose
optional feature (for those who are willing to compromise some things in
return :) ).

Some more thoughts:


   - "types" in a schemaless JSON data structure can be only determined by
   a function that determines the type. Hence, there should be 'type'
   determining functions, or classifiers.
   - Likewise, we have thus far been using a dotted-notation convention to
   address specific attributes. This convention or some similar one can be
   used by the relationship module (ex: "os.version", "
   last_modified.by.user.id"), as long as the 'keys' themselves don't have
   a period ;)
   - every relationship will be kept 'in memory', in much the same way as
   how validate doc update functions are kept 'in memory' and used for every
   write.
   - regular Doc PUT/POST API will fail when a document's (of classifiable
   'type') attribute which is involved in a relationship is changed.
   - To modify an attribute that is involved in a relationship, a
   "transactional update" API must be used. All the related documents for
   those change(s), must also be submitted through this API "bulk_doc"-like
   API (perhaps bulk_docs itself?).
   - The idea is, a client initiating the transaction update will fetch all
   related documents, through a helper API which "denormalizes" all related
   documents and returns as a larger hash.
   - This will also reference the defined relationships and follows a 3PC
   protocol (where an extra metadata field in the document will be used to
   keep state of the ongoing "transaction") to allow potential failures during
   concurrent other transactional updates.

Thus, a design document that describes a relationship would look something
like:

{
  "ClassifiedTypePerson": {
    "classifier": function (doc) {
             if (doc.blah && doc.blah2) {
                 return true;
             }
     },
    "relationships": [ { "from": "my.attribute.to.reference.daddy", "to":
"ClassifiedTypeDaddy", "type": "1:1" },
                             { from":
"my.other.attribute.to.reference.kids", "to": "ClassifiedTypeChildren",
"type": "1:*" }
 ]
}

This is just a sugary way of defining some commonly recurring
auto-validation rules which invariably reference / depend on other
documents and it is not without compromises.

The compromises are:
- one-shard-forever compromise: since this is about infrastructure, the
size of the data-set will fit under 2-4 GB. So even if the entire DB has to
be read by Couch, we don't care. This way, whatever "related" documents
will all be found on the same disk. Unless, we formalize distributed
- unpredictable write times compromise: Every write will involve
predictable number of reads and predictable failure for those attributes
which are defined under a 'relationship' (attributes with relationships can
be modified only through a separate 'special' API where all the related
documents

What do you think about this? Would people here find use for this in your
day-to-day needs? Would the couchdb-devs merge this into mainstream couchdb
if such a patch is submitted?

Regards,

  -Suraj

-- 
An Onion is the Onion skin and the Onion under the skin until the Onion
Skin without any Onion underneath.

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message