incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Blakey <antony.bla...@gmail.com>
Subject Architectural advice requested, and some 1.0 suggestions.
Date Tue, 04 Nov 2008 23:47:50 GMT
Hi, [ I'm posting to both -user and -dev because IMO it's relevant to  
both ]

I'm in the middle of a trial port of a system from postgresql to  
couchdb. The system has a distribution/replication requirement due  
primarily to unreliable connectivity and limited end-to-end bandwidth  
(and high latency) over a geographically dispersed set of users.

I have two requirements that don't fit well with the basic couchdb  
model. I've developed solutions for which I'd appreciate some  
feedback. Also it's a request for the notification mechanism to be  
improved, something like _external to moved to the core, and/or an in- 
process erlang plugin model that could replace notification/_external- 
based techniques.

My solution is a generalization of the GeoCouch technique: http://vmx.cx/cgi-bin/blog/index.cgi/geocouch-geospatial-queries-with-couchdb

, and it allows me to mix SQL and couchdb access, all with the couchdb  
model and without breaking replication.

Requirement 1: I have a User/Role/Permission model for security. Users  
have roles and roles have permissions, and the primary (and very  
common) operation is to check for a transitive (user, permission)  
pair, which involves a join. It would be wrong to invert the user-role  
relationship and put both a list of users and a list of permissions  
into the role document because it is the user that joins roles and the  
maintenance of the user-role relationship is done when editing the user.

Requirement 2: I have a metadata search mechanism that allows the user  
to construct arbitrary queries (in the problem domain, not SQL). The  
transformation of the model from relational to document-based has  
eliminated any joins, but the problem isn't amenable to the oft- 
described key-based virtual joins, and I need to avoid retrieving  
potentially large sets of results and doing set operations in the  
client because it doesn't scale.

My solution is to use the notification mechanism to maintain  
(multiple) SQLite databases containing the document keys I need to  
search over. In the URP example, I store (db, src, dest) records. I  
also store the last seqno, so that I can do incremental updates to  
SQLite.

Then, using _external, I can make queries of SQLite to get me either  
(user, permission) pairs using a self-join, or in the case of the  
arbitrary metadata queries, a list of document ids.

The primary difficulties I have with this simple model are:

1) The notification mechanism doesn't give me enough information.  
Currently I have to do a _all_docs_by_seq and check for deletions by  
attempting to get each document, which I have to do for every document  
in any case (unless I use transparent id's) to determine if I'm  
interested in it, and then get the data. I presume this technique  
works because deletion is actually a form of update until compaction  
(I copied it from GeoCouch).

** SUGGESTION ** I would prefer an update interface that gave me a  
stream of (old, new) document pairs, which covers add/update/delete,  
plus a (from, to) seqno pair. Have a 'from' seqno lets me know when I  
have to trigger a full rescan, which I need to do in a variety of  
circumstances such as configuration change.

2) The process is not as performant as it could be because of the JSON  
marshalling.

** SUGGESTION ** I would prefer to write the notification process in  
erlang, and be passed the internal representation of the pre/post  
documents to avoid marshalling. Same for map/reduce. I know I can  
modify couchdb explicitly to do this, but that commits me to git  
tracking/merging etc. A in-process plugin option would be better.

3) Allowing the notification and _external handlers to communicate is  
more work than it should be. I want to do this so that the _external  
handler can cache the queries and invalidate the cache in response to  
updates.

As an alternative, for the URP data, I could use memcached and let the  
notification process invalidate the cache, which would mean that the  
_external handler doesn't have to cache at all because the client  
would only hit it on a memcached miss. The ultimate extension of this  
would be to have no _external handler, and have the notification  
process not use SQLite at all and instead let the client maintain the  
data in memcached using normal couchdb requests to calculate the  
transitive closure, which would be invalidated by the notification  
process. Unless the client passivates this data to disk it would mean  
recreating it (on-demand, incrementally) each time the client process  
starts. This is a difficult problem with multiple clients (from a  
development coordination perspective), so maybe an _external process  
is still required, even if it doesn't use on-disk passivation.

That solution doesn't work for the generalized metadata queries, but  
that's not a problem.

** SUGGESTION ** I think the _external mechanism (or something better)  
should be integrated as a matter of priority. IMO this is a key  
mechanism for augmenting the map/reduce model to satisfy more use-cases.

** SUGGESTION ** A plugin mechanism should allow a plugin to both  
listen to notifications and respond to e.g. _external.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Human beings, who are almost unique in having the ability to learn  
from the experience of others, are also remarkable for their apparent  
disinclination to do so.
   -- Douglas Adams



Mime
View raw message