couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: First Demo/Draft of _access / per document permissions
Date Thu, 16 Nov 2017 09:00:11 GMT
Thanks Adam,

we talked about limiting the number of roles that a user could have to I think 10 to keep
the multi-query complexity at bay. And I think we also talked about just keeping the individual
segment update-seq’s around, but we didn’t speak about the size/complexity of the combined
seq-id if I recall correctly.

If n^2-1 when n <= 10 is acceptable for seq-id size, we’re on track. If not, that’s
a TBD.

My notes say I wanna tackle roles next, but given your feedback, I think I’ll try and get
the username-only version of by-access-id and by-access-seq working first. That’s a good
enough milestone to see through, and maybe even ship, before diving into roles right away.

That said, there’s quite a bit of work left, so I’m not in a hurry figuring out roles
and adding that pre-shipping.

* * *

I’ve updated the gist and _all_docs now has a an un-munged key member, that’s just the
doc-id.

The next iteration of this will live in a branch & PR, so we can discuss details there.

Best
Jan
--


> On 16. Nov 2017, at 04:26, Adam Kocoloski <kocolosk@apache.org> wrote:
> 
> Hi Jan,
> 
> I took a closer read and I do think you’re on the right path. I certainly agree with
reusing the secondary index machinery to create the extra internal indexes.
> 
> On the by-access-seq index … did we ever discuss how to efficiently track and report
the last observed sequences from the various ranges of the index to which a user has access?
I suppose the single seq from each contributing shard could change to an array of seqs, one
from each range. I do worry about the size of the merged sequence (I’m remembering the 2^n-1
possible role combinations granting access for a user possessing n roles). I didn’t see
anything in the summit notes.
> 
> Adam
> 
>> On Nov 15, 2017, at 4:35 PM, Jan Lehnardt <jan@apache.org> wrote:
>> 
>> Hi all,
>> 
>> in the midst of handling the security stuff I had a moment of clarity how the often
requested per document permissions could be implemented. We had discussed a potential approach
extensively in the February Boston Developer Summit (notes here: https://lists.apache.org/thread.html/09a5686bca8049010b82796cc0fe99ef27aed4983a3f02fd6956259f@%3Cdev.couchdb.apache.org%3E)
>> 
>> What was so alluring about this proposal was that it solves per doc access control
and per-user-db in one go. E.g. it would be able to share a single database with multiple
distrusting users, allow them to have their own set of views, and even independently use their
share of a single database as a replication endpoint without interfering with any of the other
users on that database.
>> 
>> I gave it a shot. Essentially, we need to build new indexes: by-access-id and by-access-seq
to make all that work. I’m just focussing on the core of this, trying to re-use the existing
couch_mrview/couch_index machinery as much as possible. Strictly, for replication only by-access-seq
would be required, but by-update-id is a little easier to do, so I’ve done that first, and
I believe the results are encouraging.
>> 
>> I’ve put a diff against master into a gist for your perusal:
>> 
>> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc
>> 
>> 
>> The core bits are:
>> 
>> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc#file-by-access-id-diff-L189-L215
>> 
>> and
>> 
>> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc#file-by-access-id-diff-L189-L215
>> 
>> Here’s an example Doc:
>> 
>> {
>> "_id":"1fb94bf8c3d5a73745f3cc4f5f000a8d”,
>> "_rev":"4-bcbc975e61bdb80f3de1b87f6cad6a76”,
>> "_access":["b”]
>> }
>> 
>> It shows up for user b:
>> 
>> 
>> curl b:b@127.0.0.1:15984/a/_all_docs
>> 
>> {"total_rows”:2,"offset":0,"rows":[
>> {"id":"1fb94bf8c3d5a73745f3cc4f5f000a8d","key":["b","1fb94bf8c3d5a73745f3cc4f5f000a8d"],"value":"4-bcbc975e61bdb80f3de1b87f6cad6a76”}
>> ]}
>> 
>> But not for user c:
>> 
>> 
>>> curl c:c@127.0.0.1:15984/a/_all_docs
>> 
>> {"total_rows”:2,"offset":2,"rows":[
>> 
>> ]}
>> 
>> 
>> * * *
>> 
>> 
>> I’d like to get some general design feedback on this approach to find out if it
is worth pursuing further. See “Next Steps” way below for my thinking on how to get by-access-seq
going.
>> 
>> The rest of this email are my notes from reading the source and trying to explain
my thinking as well as guide folks that might not be very familiar with the CouchDB sources
to follow along what is happening.
>> 
>> I’d especially like to get some feedback about this from some of the folks here
who don’t spend their days in the main Erlang codebase :)
>> 
>> Let me know what you think.
>> 
>> Thanks!
>> Jan
>> 
>> * * *
>> 
>> CouchDB Access Notes
>> 
>> Background: https://lists.apache.org/thread.html/09a5686bca8049010b82796cc0fe99ef27aed4983a3f02fd6956259f@%3Cdev.couchdb.apache.org%3E
>> 
>> # Overview
>> 
>> To solve the problems with the db-per-user pattern, we want to introduce document
level access control. The result should be a single CouchDB database that can be used by multiple
mutually untrusting users while retaining CouchDB’s full semantics.
>> 
>> // TODO: link to appendix: problems with db-per-user
>> 
>> We decided on an approach to define access control in documents with a new property
`_access` which is specified as an array of strings and arrays. Strings represent usernames
and roles, sub-arrays are used as logical AND, elements in the top level array are used as
logical OR. For example. an _access field with the value [[‘management’, ‘senior’],
‘ceo-jane’] would allow everyone with the roles ‘management’ AND ‘senior’, OR
the user ‘ceo-jane’ access to that doc. but not e.g. users with roles ‘development’,
‘senior’, nor user ‘vp-jenn’.
>> 
>> To achieve main CouchDB semantics, we need to introduce new behaviour for the _all_docs
and _changes endpoints. The plan is to special case-this based on the authenticated user context
(userCtx, e.g, username and associated roles, after authentication).
>> 
>> The existing by-id and by-seq indexes are not equipped to efficiently return results
per user, so we are introducing two new indexes (either can be optionally configured, depending
on the use-case and performance and storage needs): by-access-id and by-access-seq. In contrast
with by-id and by-seq, these indexes are not stored in the main database file, but in a separate
file, ideally managed by the existing couch_index infrastructure.
>> 
>> 
>> # Development considerations
>> 
>> This first spike is only concerned with getting per-access-id to work with minimal
effort.
>> 
>> To get started, let’s look at how _all_docs works today using the by-id index.
>> 
>> ## The Anatomy of a Clustered _all_docs Request
>> 
>> CouchDB’s clustering layer consists of three main modules: chttpd, fabric and refi.
chttpd’s job is to handle everything HTTP and route requests to the right place in the rest
of the code. It’s a HTTP router, mapping URLs, request methods and options to handler functions
that do with the work the requests are specified to fulfil.
>> 
>> fabric’s job is to distribute a single request from the outside to multiple nodes
of the cluster. Some requests require only talking to the local node, but that’s less important
for the moment. fabric includes fabric_rpc, a module that turns a request to the cluster into
one or more requests to other nodes in the cluster.
>> 
>> rexi’s job is know about the cluster state: which nodes are in the cluster, which
of them are active/reachable/failed, which shards live on which nodes. fabric uses rexi to
know which nodes to contact for which shards.
>> 
>> After a bit of indirection, we find ourselves at the first _all_docs-specific function
in chttpd_db.erl: all_docs_view/4:
>> 
>> ```
>> all_docs_view(Req, Db, Keys, OP) ->
>>   Args0 = couch_mrview_http:parse_params(Req, Keys),
>>   Args1 = Args0#mrargs{view_type=map},
>>   Args2 = couch_mrview_util:validate_args(Args1),
>>   Args3 = set_namespace(OP, Args2),
>>   Options = [{user_ctx, Req#httpd.user_ctx}],
>>   Max = chttpd:chunked_response_buffer_size(),
>>   VAcc = #vacc{db=Db, req=Req, threshold=Max},
>>   {ok, Resp} = fabric:all_docs(Db, Options, fun couch_mrview_http:view_cb/2, VAcc,
Args3),
>>   {ok, Resp#vacc.resp}.
>> ```
>> 
>> The first five lines handle query options and request parameters or arguments. The
next three lines are the bulk of the job: start a response, call fabric:all_docs/5 with a
callback to handle rows. The last line returns the accumulator that is returned by fabric:all_docs/5.
>> 
>> fabric:all_docs/5 is a thin wrapper around fabric_view_all_docs:go/5. Before we jump
down, we notice that there is also a fabric_view_changes.erl, which we should remember for
the next iteration when we implement by-access-seq.
>> 
>> go/5 comes in two variants and we’ll ignore the second here for the moment, because
it is a performance optimisation. The main work for go/5 is in the top third of the function.
First it gets all shards for the current database from mem3, then it starts a fabric_rpc worker
for each shard, and then waits for the results to come back by calling go/6 with all workers.
The bottom two thirds are timeout and error handling.
>> 
>> go/6 registers the handle_message/3 function as the callback for rexi_utils’ recv/6
(read “receive”) function.
>> 
>> handle_message/3 comes in a number of variants to handle rexi errors, receiving metadata,
receiving result rows and a notification “complete” about all rows having been sent.
>> 
>> Our next level down is looking into fabric_rpc and how it handles all_docs requests.
fabric_rpc/3 is again a short wrapper, this time around couch_mrview:query_all_docs/4 which
is the node-local function that handles querying.
>> 
>> couch_mrview includes a bunch of functions map/reduce views. It seems like a natural
place doing our distinction between a normal by-id request and a by-access-id request.
>> 
>> I’m skipping a step here, but with a little printf-debugging, I’ve found out
that the `Db` variable we get passed in, includes the authenticated userCtx including username
and any roles.  We can use couch_db:is_admin/1 to get a boolean back for the distinction we
are going to have to make:
>> 
>> ```
>> query_all_docs(Db, Args0, Callback, Acc) ->
>>   case couch_db:is_admin(Db) of
>>       true -> query_all_docs_admin(Db, Args0, Callback, Acc);
>>       false -> query_all_docs_access(Db, Args0, Callback, Acc)
>>   end.
>> ```
>> 
>> query_all_docs_admin/4 is the existing query_all_docs/4 function and we’re introducing
query_all_docs_access/4, that we now have to fill out with querying our view.
>> 
>> Before we can do that, we need to understand how view work.
>> 
>> 
>> ## The Anatomy of a View Request
>> 
>> Querying a view has three stages:
>> 
>> 1. define the view
>> 2. build the view index
>> 3. query the view index
>> 
>> A view definition is always in a design document. It can be one or JavaScript map/reduce
functions, Erlang map/reduce functions, or a mango index definition.
>> 
>> // TODO: link all these view definition options.
>> 
>> Building the view index is an implicit step in CouchDB. View indexes are refreshed
at query time, but only if there were any changes in the database since the last query. If
no refresh is needed, the view result is returned from the index directly.
>> 
>> // TODO: explain query_server
>> 
>> Querying indexes follows a similar path through chttpd, fabric, rexi, fabric_rpc
down to the per-node handlers in couch_mrview. Just a few lines below couch_mrview:query_all_docs/4
we find query_view/5 which decides between map and reduce requests. We care about map-only
for now. query_view/5 is preceded by query_view/6 which includes a call to couch_mrview_util:get_view/4
which looks like it is where we want to look next, as the map_fold/5 called by query_view/5
is about looping over rows. We hope we can re-use all that logic, and maybe get_view/4 lets
us find out how we can have it return our new view.
>> 
>> get_view/4 calls get_view_index_state/4 which in turn calls get_view_index_pid/4
that finally calls into couch_index_server:get_index/4 which looks like it returns the index
for our request. Let’s have a look.
>> 
>> get_index/4 will dive into get_index/2 eventually and that looks indeed like where
we need to look. In there, we look up view index in an ETS table (an in-memory database),
and if it can’t find it there, start a new one. Either way, a view index is returned. The
lookup is by DbName and Sig(nature), an md5 hash over the `views` property in a design doc,
that also corresponds to the *.view filename of the view index.
>> 
>> 
>> ## Faking the index
>> 
>> So how would we get this to return the index we want to query? We need to create
an index definition that matches the design doc `views` hash. Hm.
>> 
>> It is relatively easy to produce a map function that behaves like we want:
>> 
>> function (doc) {
>> var _access = doc.access
>> if (!_access) { return }
>> if (!isArray(_access) || _access,length === 0) { return }
>> _access.forEach( function (user_or_role) {
>>   emit([user_or_role, doc._id], doc._rev)
>> })
>> }
>> 
>> At query time, we’d have to match the requesting username and roles against the
first element in the key-array and return the results, while replacing the key-array with
the second element (the doc _id). All this doesn’t sound too hard. Good.
>> 
>> One snag though: if we think ahead and try to see how we could implement by-access-changes
we get stuck: a view does not include rows for deleted documents while _changes does. In addition,
the update sequence for a document is not available in a map function. So a regular view can
not be used here.
>> 
>> The filtering of deleted docs from a view index happens in couch_mrview:map_fold/3.
So if we could augment that for our internal view requests, that could get us a long way towards
reusing the rest of the couch_mrview/couch_index machinery.
>> 
>> Note to self: make sure view compaction doesn’t remove deleted docs. But a cursory
glance at couch_mrview_compactor:compact_view_btree/5 suggests no such thing, but we need
to validate this, and if it doesn’t hold, change view_compation to keep deleted entries.
>> 
>> * * *
>> 
>> We’ll start giving this a try by forking things off in couch_mrview:query_all_docs/4
and pretending to call a view with a mocked ddoc:
>> 
>> {
>> “_id”: “_design/_access”,
>> “language”: “_access”
>> “views”: {} // if needed
>> } // TODO see which other fields it needs
>> 
>> We’ll try this road to see if we get to the point where we get a “view index
not found” error, because we didn’t actually have a view index yet. We’ll then try and
see if we can produce one. We could try the other way around too, building the index first
and then trying to query, but the approach doesn’t make much of a difference.
>> 
>> First demo working: https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc
>> 
>> 
>> Next Steps:
>> - make sure the startkey/endkey/descending argument handling is all correct and complete
>> - add key un-munging, so the user/role prefix gets filtered out on reads
>> - handle roles:
>>   - instead of querying the _access view once, we need to issue a multi-query, probably
via #mrags.multi_get, read up on how that is used
>> - then we could start thinking about by-access-seq:
>>   - we need access to the update-seq in couch_access_native_proc:map_doc, might require
view protocol upgrade, or we have a post-process function that tags on the update-seq, we’ll
see.
>>   - the admin/access split we’re doing in query_all_docs should probably happen
in couch_db:changes_since/5
>> 
>> 
>> 
>> 
>> 
>> 
>> # More specification details
>> 
>> 
>> Documents with in databases with _access enabled are private/admin-only by default,
and can be made public with the special role _public
>> 
>> TODO: shared id space or auto-prefix ids
>> 
>> 
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/


Mime
View raw message