Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of suraj.kumar@inmobi.com
 designates 209.85.213.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABadZm+67XC9BKspBSUrJqHuaTi3q3yLCht74qinO9Nsew9BYg@mail.gmail.com>
References: 
 <CABadZm+67XC9BKspBSUrJqHuaTi3q3yLCht74qinO9Nsew9BYg@mail.gmail.com>
Date: Wed, 16 Apr 2014 09:29:12 +0530
Message-ID: 
 <CABadZmLWd02SRbfvdRYkWFeqW2Am8Sk-Rj+FD+JbBSSoiBhXoQ@mail.gmail.com>
Subject: Re: Modeling Relationships and providing Transactional Integrity
From: Suraj Kumar <suraj.kumar@inmobi.com>
To: dev@couchdb.apache.org, user@couchdb.apache.org
Content-Type: multipart/alternative; boundary=047d7bd911d48195ff04f720f056

--047d7bd911d48195ff04f720f056
Content-Type: text/plain; charset=UTF-8

Hello,

On second reading, it appears I've made a ton of typos and half-done
sentences in my original post. But leaving those aside, has anybody managed
to read this through and give it a thought? Any questions / clarifications?
We'd really like to get started on the right way that will be useful. So
your advise will be highly useful.

Thanks,

  -Suraj


On Thu, Apr 10, 2014 at 6:24 PM, Suraj Kumar <suraj.kumar@inmobi.com> wrote:

> [warning: cross-posted]
>
> Hi,
>
> We're attempting to build a model of a large scale, complex
> Infrastructure. That means, every machine their supporting machines report
> to mothership. Since our problem is truly that of high concurrency,
> choosing a solid data base to keep state of this model became the focus in
> our erstwhile days. We zero'ed in on CouchDB: actually, due to the fact
> that there is Erlang powering it and that we can pull off other things (not
> met by CouchDB) which Couch doesn't provide. One of those things was the
> notion of Relationships.
>
> What do I mean by "Relationships" really? Some "types" of Entities have
> attributes which may potentially be related some other "types" of Entities
> in specific known ways (1:1, 1:*, *:1).
>
> The "Type" becomes the hazy part for schemaless systems like CouchDB.
> However, let us now talk in Couch primitives.
>
> Let us set aside the question of how this could potentially still result
> in inconsistency in a live distributed database... and imagine if there
> could be 'design' documents that describe how some attributes of some
> "types" of documents are related to some other attributes of some other
> "types" of documents. Imagine, if this could be used by this new
> 'Relationships' engine to automatically validate and keep relational
> integrity of the database. To describe in couch-terminology, it is a way to
> automatically modify certain keys of related document whenever certain keys
> of a given 'type' of document changes.
>
> I'm now attempting to formally describe two of the basic primitive
> elements of every practical schemaless database system, specifically
> CouchDB:
>
> 1. Documents of classifiable 'types' or 'sets'.
> 2. Attributes (*keys of the JSON hash*) (and a way to address attributes
> using a generic, intuitive and a standard "*convention*")
>
> I am of the belief that defining these two formally is the first step to
> approach implementing Relationships in CouchDB as a usable general purpose
> optional feature (for those who are willing to compromise some things in
> return :) ).
>
> Some more thoughts:
>
>
>    - "types" in a schemaless JSON data structure can be only determined
>    by a function that determines the type. Hence, there should be 'type'
>    determining functions, or classifiers.
>    - Likewise, we have thus far been using a dotted-notation convention
>    to address specific attributes. This convention or some similar one can be
>    used by the relationship module (ex: "os.version", "
>    last_modified.by.user.id"), as long as the 'keys' themselves don't
>    have a period ;)
>    - every relationship will be kept 'in memory', in much the same way as
>    how validate doc update functions are kept 'in memory' and used for every
>    write.
>    - regular Doc PUT/POST API will fail when a document's (of
>    classifiable 'type') attribute which is involved in a relationship is
>    changed.
>    - To modify an attribute that is involved in a relationship, a
>    "transactional update" API must be used. All the related documents for
>    those change(s), must also be submitted through this API "bulk_doc"-like
>    API (perhaps bulk_docs itself?).
>    - The idea is, a client initiating the transaction update will fetch
>    all related documents, through a helper API which "denormalizes" all
>    related documents and returns as a larger hash.
>    - This will also reference the defined relationships and follows a 3PC
>    protocol (where an extra metadata field in the document will be used to
>    keep state of the ongoing "transaction") to allow potential failures during
>    concurrent other transactional updates.
>
> Thus, a design document that describes a relationship would look something
> like:
>
> {
>   "ClassifiedTypePerson": {
>     "classifier": function (doc) {
>              if (doc.blah && doc.blah2) {
>                  return true;
>              }
>      },
>     "relationships": [ { "from": "my.attribute.to.reference.daddy", "to":
> "ClassifiedTypeDaddy", "type": "1:1" },
>                              { from":
> "my.other.attribute.to.reference.kids", "to": "ClassifiedTypeChildren",
> "type": "1:*" }
>  ]
> }
>
> This is just a sugary way of defining some commonly recurring
> auto-validation rules which invariably reference / depend on other
> documents and it is not without compromises.
>
> The compromises are:
> - one-shard-forever compromise: since this is about infrastructure, the
> size of the data-set will fit under 2-4 GB. So even if the entire DB has to
> be read by Couch, we don't care. This way, whatever "related" documents
> will all be found on the same disk. Unless, we formalize distributed
> - unpredictable write times compromise: Every write will involve
> predictable number of reads and predictable failure for those attributes
> which are defined under a 'relationship' (attributes with relationships can
> be modified only through a separate 'special' API where all the related
> documents
>
> What do you think about this? Would people here find use for this in your
> day-to-day needs? Would the couchdb-devs merge this into mainstream couchdb
> if such a patch is submitted?
>
> Regards,
>
>   -Suraj
>
> --
> An Onion is the Onion skin and the Onion under the skin until the Onion
> Skin without any Onion underneath.
>
>


-- 
An Onion is the Onion skin and the Onion under the skin until the Onion
Skin without any Onion underneath.

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

--047d7bd911d48195ff04f720f056--