couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@php.net>
Subject Re: Nested views: iterative map reduce implementation
Date Sun, 27 Jan 2013 07:59:38 GMT
Hi Nicholas,

This looks very interesting. My apologies for not looking into this in more detail (it's CouchDB
Conf week). I'll definitely have a closer look soon.

Cheers
Jan
--

On 22.01.2013, at 16:18, "nicholas a. evans" <nick@ekenosen.net> wrote:

> I've started hacking on an implementation of iterative map reduce that I'm
> calling "nested views".  It's not nearly finished, but well overdue for
> feedback from people who know what they're doing with CouchDB and Erlang. :)
> 
> ****************************
> My bosses are open to a short term contract with an Erlang hacking CouchDB
> committer to get this into a safely usable state ASAP.
> ****************************
> 
> == Intended Usage
> 
> To create a nested view, add "views" inside of a view like so:
> 
>    {
>      "_id": "_design/ddoc",
>      "language": "javascript",
>      "views": {
>        "foo_count": {
>          "map": "function (doc) { emit(doc.foo, 1); }",
>          "reduce": "_sum",
>          "views": {
>            "sorted_by_count": {
>              "map":"function(doc) { emit(doc.value, nil); }"
>            }
>          }
>        }
>      }
>    }
> 
> The parent row is reduced with group_level=exact, and each reduced row is
> sent to the nested view map functions as a document that looks like so:
> 
>    {
>      "_id": json_encode(parent_view_key),
>      "key": parent_view_key, "value": reduced_value,
>    }
> 
> Nested views would be queried using the standard view API like so:
> 
>    GET /db/_design/ddoc/_view/foo_count/_view/sorted_by_count?query=args
> 
> Using include_docs=true would return the reduced parent row as the doc.
> 
> == Comparison to Cloudant's "dbcopy"
> 
> The trade-offs between this approach and chaining to a different design doc
> or a different database are similar to the trade-offs of storing all of your
> views in one design doc versus a view per design doc:
> 
> * simplicity of one design doc vs multiple
> * consistency within design doc index vs eventual consistency
> * high vs low latency between doc insertion and view availability
> * entire index is invalidated when one function changes
> * some views may need to be compacted more frequently than others
> 
> I have some application code that could take advantage of each approach, and
> ultimately I'd like to see both styles supported natively by Apache CouchDB.
> 
> == Implementation notes
> 
> New record fields: #mrview{path=[], nested_id_btree=nil}.
> 
> Each nested view creates a new View#mrview object which is stored directly
> on the #mrstate.views list.  The view state saved in the #mrheader will
> include a second btree in parent views.  View#mrview.nested_id_btree keeps
> track of ToRemove entries for incremental updates (analogous to
> #mrstate.id_btree).  View#mrview.path represents the "path" of parent views from
> the root as a list.  It will be used used (by couch_mrview_util:get_view/4) for
> view lookup, influences the sort order of the views in #mrst.views, simplifies
> include_docs, detects "safe" top-level views during the initial mapper index
> updater pass, and ensures uniqueness when comparing one view to another (e.g.
> views could share their map definitions, but have different data sources).  If
> two different paths share identical function code, they will still result in two
> View indexes.
> 
> In origin/master, couch_mrview_updater:update/2 starts up two linked
> processes from the parent index updater process; one for mapping, one for
> writing (reductions stored in the btree are handled by the couch_btree
> module).  I've created a third process to handle the nested mapping.  The writer
> process traverses the view tree depth-first, and should send the reduced rows
> (group_level=exact) to the nested mapper process.  The writer process waits for
> the nested results before continuing, because otherwise the mrstate would claim
> a higher update_seq than the nested views have completed indexing and the index
> file would be temporarily inconsistent.
> 
> My (incomplete) code is at http://github.com/nevans/couchdb (in master).  I did
> a lot of refactoring just so I could understand the CouchDB code better and to
> test my knowledge of Erlang.  Most of these were splitting up large functions,
> renaming variables, and changing recursive functions into higher order functions
> passed to lists:foldl (I have the same aversion to explicit looping in ruby when
> Enumerable#inject can do the job).  I noticed that master doesn't contain a
> single merge commit(!) so obviously this would need to be cleaned up a lot
> (cosmetic refactorings could be backed out or placed in separate commits) if
> there is ever any interest in merging it.
> 
> == Implementation progress
> 
> First-pass "done" so far:
> 
> * Tests!
>   * First do no harm: still passing original tests.
>   * Stupid basic test that creates a single simple nested view
>     and tries to update it
> * Creation
>   * Parse the design docs to create the nested views.
> * Updates
>   * Traversal of nested views by writer process.
>   * Visiting each nested view by nested mapper process.
>     * although its buggy and crashes when I send in sample data.
> * Retrieval
>   * nothing yet...
> 
> Still incomplete in my branch (with some questions):
> 
> * Tests!
>   * more complicated nested views
>   * queried nested views
>   * queried with include_docs
> 
> * Creation
>   * Add the nested views to the mrheader; read them in and write them out.
>     couch_mrview_util:init_state/4
>     couch_mrview_util:open_view/5
>   * Initialize the nested_id_btree (with appropriate Reduce, Less, etc).
>     couch_mrview_util:open_view/5
>   * Consider nested view defs in computing the index file's MD5.
>     couch_mrview_util:ddoc_to_mrst/2
>     Is this automatically accomplished by updating #mrst.views and #mrview?
> 
> * Updates
>   * Update sample data without crashing!
>   * Pull in the group/reduced rows and format them for the nested mapper.
>     * This keeps stumping me.
> 
> * Retrieval
>   * Query nested views at all.
>   * Query nested views with include_docs.
> 
> * Etc (does this matter?)
>   * couch_mrview_util:calculate_data_size/2
>   * ???
> 
> == Questions
> 
> As an Erlang newbie hacking on unfamiliar source code, I hope I haven't made
> extra work for myself by going with this approach vs a view changes feed
> multi-db Cloudant-style approach...  Is there anything incredibly stupid about
> this approach?  :-)
> 
> Any thoughts on the intended API?
> 
> I'm having a hard time querying the view in its current state from the internals
> of couch.  I've got the partially updated View state object with its updated
> Btrees, which is *not* the same View and Btree that would come back if I were to
> use couch_mrview_util:get_view(Db, DDoc, Vname, Args).  But I haven't been able
> to successfully hack together a version of couch_mrview:query_view that will do
> what I want.
> 
> Can incremental updates for the most common use case (alternate sort
> indexes) be handled by something simpler than a completely generic iterative
> map reduce?
> 
> I've branched from origin/master.  If I want to put this into production for
> myself as soon as its "done", should I use the 1.3 branch instead?
> 
> I've probably created some performance regressions in the nested view
> traversal.
> Any thoughts there?  How should I test that (other than by timing my own
> datasets)?  Right now, getting this done is more important for us than getting
> it done optimally.
> 
> Is there a "view index version" number that should be incremented somewhere,
> so the new mrheader format doesn't break old views?  Is there a good way to
> update a view file in place (e.g. see an old header and rewrite a new
> header)?
> 
> Since the nested map and nested write queues never have more than one entry
> at a time, should I even bother with the overhead of couch_work_queue?
> Should I just use the plain old Erlang "send message receive response"
> pattern instead?  Since it is done synchronously, should it even use a
> separate process?  To avoid CPU bottlenecks, it like "merge_results" should
> probably also be in its own process?
> 
> On a side not, it seems wasteful to constantly be passing source code to the
> view servers.  Have you considered using a pattern like redis's evalsha
> scripting?  First check (using the checksums for the functions) to see if the
> functions have been loaded on that view server.  If not then load them.  All
> other references to functions use the checksum instead.  Perhaps Spidermonkey is
> smart enough to keep an internal function cache, but it still seems like
> non-trivial IO bandwidth could be wasted on large function bodies.
> 
> 
> Since you've read this far, a reminder: My bosses are open to a short term
> contract with an Erlang hacking CouchDB committer to get this into a safely
> usable state ASAP.
> 
> Thanks for any help!  I'll also be available in #couchdb and #couchdb-dev.
> 
> Nick Evans

Mime
View raw message