couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Nested views: iterative map reduce implementation
Date Fri, 15 Mar 2013 19:58:59 GMT
Definitely been meaning to provide a review for this. I'll try and get
to it in a few hours.

On Fri, Mar 15, 2013 at 2:36 PM, Noah Slater <nslater@apache.org> wrote:
> Hey Nick,
>
> This looks pretty great. Is it something you need assistance with? Anything
> I can do to help?
>
> Thanks,
>
>
> On 22 January 2013 15:18, nicholas a. evans <nick@ekenosen.net> wrote:
>
>> I've started hacking on an implementation of iterative map reduce that I'm
>> calling "nested views".  It's not nearly finished, but well overdue for
>> feedback from people who know what they're doing with CouchDB and Erlang.
>> :)
>>
>> ****************************
>> My bosses are open to a short term contract with an Erlang hacking CouchDB
>> committer to get this into a safely usable state ASAP.
>> ****************************
>>
>> == Intended Usage
>>
>> To create a nested view, add "views" inside of a view like so:
>>
>>     {
>>       "_id": "_design/ddoc",
>>       "language": "javascript",
>>       "views": {
>>         "foo_count": {
>>           "map": "function (doc) { emit(doc.foo, 1); }",
>>           "reduce": "_sum",
>>           "views": {
>>             "sorted_by_count": {
>>               "map":"function(doc) { emit(doc.value, nil); }"
>>             }
>>           }
>>         }
>>       }
>>     }
>>
>> The parent row is reduced with group_level=exact, and each reduced row is
>> sent to the nested view map functions as a document that looks like so:
>>
>>     {
>>       "_id": json_encode(parent_view_key),
>>       "key": parent_view_key, "value": reduced_value,
>>     }
>>
>> Nested views would be queried using the standard view API like so:
>>
>>     GET /db/_design/ddoc/_view/foo_count/_view/sorted_by_count?query=args
>>
>> Using include_docs=true would return the reduced parent row as the doc.
>>
>> == Comparison to Cloudant's "dbcopy"
>>
>> The trade-offs between this approach and chaining to a different design doc
>> or a different database are similar to the trade-offs of storing all of
>> your
>> views in one design doc versus a view per design doc:
>>
>>  * simplicity of one design doc vs multiple
>>  * consistency within design doc index vs eventual consistency
>>  * high vs low latency between doc insertion and view availability
>>  * entire index is invalidated when one function changes
>>  * some views may need to be compacted more frequently than others
>>
>> I have some application code that could take advantage of each approach,
>> and
>> ultimately I'd like to see both styles supported natively by Apache
>> CouchDB.
>>
>> == Implementation notes
>>
>> New record fields: #mrview{path=[], nested_id_btree=nil}.
>>
>> Each nested view creates a new View#mrview object which is stored directly
>> on the #mrstate.views list.  The view state saved in the #mrheader will
>> include a second btree in parent views.  View#mrview.nested_id_btree keeps
>> track of ToRemove entries for incremental updates (analogous to
>> #mrstate.id_btree).  View#mrview.path represents the "path" of parent
>> views from
>> the root as a list.  It will be used used (by
>> couch_mrview_util:get_view/4) for
>> view lookup, influences the sort order of the views in #mrst.views,
>> simplifies
>> include_docs, detects "safe" top-level views during the initial mapper
>> index
>> updater pass, and ensures uniqueness when comparing one view to another
>> (e.g.
>> views could share their map definitions, but have different data sources).
>>  If
>> two different paths share identical function code, they will still result
>> in two
>> View indexes.
>>
>> In origin/master, couch_mrview_updater:update/2 starts up two linked
>> processes from the parent index updater process; one for mapping, one for
>> writing (reductions stored in the btree are handled by the couch_btree
>> module).  I've created a third process to handle the nested mapping.  The
>> writer
>> process traverses the view tree depth-first, and should send the reduced
>> rows
>> (group_level=exact) to the nested mapper process.  The writer process
>> waits for
>> the nested results before continuing, because otherwise the mrstate would
>> claim
>> a higher update_seq than the nested views have completed indexing and the
>> index
>> file would be temporarily inconsistent.
>>
>> My (incomplete) code is at http://github.com/nevans/couchdb (in master).
>>  I did
>> a lot of refactoring just so I could understand the CouchDB code better
>> and to
>> test my knowledge of Erlang.  Most of these were splitting up large
>> functions,
>> renaming variables, and changing recursive functions into higher order
>> functions
>> passed to lists:foldl (I have the same aversion to explicit looping in
>> ruby when
>> Enumerable#inject can do the job).  I noticed that master doesn't contain a
>> single merge commit(!) so obviously this would need to be cleaned up a lot
>> (cosmetic refactorings could be backed out or placed in separate commits)
>> if
>> there is ever any interest in merging it.
>>
>> == Implementation progress
>>
>> First-pass "done" so far:
>>
>>  * Tests!
>>    * First do no harm: still passing original tests.
>>    * Stupid basic test that creates a single simple nested view
>>      and tries to update it
>>  * Creation
>>    * Parse the design docs to create the nested views.
>>  * Updates
>>    * Traversal of nested views by writer process.
>>    * Visiting each nested view by nested mapper process.
>>      * although its buggy and crashes when I send in sample data.
>>  * Retrieval
>>    * nothing yet...
>>
>> Still incomplete in my branch (with some questions):
>>
>>  * Tests!
>>    * more complicated nested views
>>    * queried nested views
>>    * queried with include_docs
>>
>>  * Creation
>>    * Add the nested views to the mrheader; read them in and write them out.
>>      couch_mrview_util:init_state/4
>>      couch_mrview_util:open_view/5
>>    * Initialize the nested_id_btree (with appropriate Reduce, Less, etc).
>>      couch_mrview_util:open_view/5
>>    * Consider nested view defs in computing the index file's MD5.
>>      couch_mrview_util:ddoc_to_mrst/2
>>      Is this automatically accomplished by updating #mrst.views and
>> #mrview?
>>
>>  * Updates
>>    * Update sample data without crashing!
>>    * Pull in the group/reduced rows and format them for the nested mapper.
>>      * This keeps stumping me.
>>
>>  * Retrieval
>>    * Query nested views at all.
>>    * Query nested views with include_docs.
>>
>>  * Etc (does this matter?)
>>    * couch_mrview_util:calculate_data_size/2
>>    * ???
>>
>> == Questions
>>
>> As an Erlang newbie hacking on unfamiliar source code, I hope I haven't
>> made
>> extra work for myself by going with this approach vs a view changes feed
>> multi-db Cloudant-style approach...  Is there anything incredibly stupid
>> about
>> this approach?  :-)
>>
>> Any thoughts on the intended API?
>>
>> I'm having a hard time querying the view in its current state from the
>> internals
>> of couch.  I've got the partially updated View state object with its
>> updated
>> Btrees, which is *not* the same View and Btree that would come back if I
>> were to
>> use couch_mrview_util:get_view(Db, DDoc, Vname, Args).  But I haven't been
>> able
>> to successfully hack together a version of couch_mrview:query_view that
>> will do
>> what I want.
>>
>> Can incremental updates for the most common use case (alternate sort
>> indexes) be handled by something simpler than a completely generic
>> iterative
>> map reduce?
>>
>> I've branched from origin/master.  If I want to put this into production
>> for
>> myself as soon as its "done", should I use the 1.3 branch instead?
>>
>> I've probably created some performance regressions in the nested view
>> traversal.
>> Any thoughts there?  How should I test that (other than by timing my own
>> datasets)?  Right now, getting this done is more important for us than
>> getting
>> it done optimally.
>>
>> Is there a "view index version" number that should be incremented
>> somewhere,
>> so the new mrheader format doesn't break old views?  Is there a good way to
>> update a view file in place (e.g. see an old header and rewrite a new
>> header)?
>>
>> Since the nested map and nested write queues never have more than one entry
>> at a time, should I even bother with the overhead of couch_work_queue?
>> Should I just use the plain old Erlang "send message receive response"
>> pattern instead?  Since it is done synchronously, should it even use a
>> separate process?  To avoid CPU bottlenecks, it like "merge_results" should
>> probably also be in its own process?
>>
>> On a side not, it seems wasteful to constantly be passing source code to
>> the
>> view servers.  Have you considered using a pattern like redis's evalsha
>> scripting?  First check (using the checksums for the functions) to see if
>> the
>> functions have been loaded on that view server.  If not then load them.
>>  All
>> other references to functions use the checksum instead.  Perhaps
>> Spidermonkey is
>> smart enough to keep an internal function cache, but it still seems like
>> non-trivial IO bandwidth could be wasted on large function bodies.
>>
>>
>> Since you've read this far, a reminder: My bosses are open to a short term
>> contract with an Erlang hacking CouchDB committer to get this into a safely
>> usable state ASAP.
>>
>> Thanks for any help!  I'll also be available in #couchdb and #couchdb-dev.
>>
>> Nick Evans
>>
>
>
>
> --
> NS

Mime
View raw message