couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noah Slater <nsla...@apache.org>
Subject Re: Nested views: iterative map reduce implementation
Date Fri, 15 Mar 2013 19:36:37 GMT
Hey Nick,

This looks pretty great. Is it something you need assistance with? Anything
I can do to help?

Thanks,


On 22 January 2013 15:18, nicholas a. evans <nick@ekenosen.net> wrote:

> I've started hacking on an implementation of iterative map reduce that I'm
> calling "nested views".  It's not nearly finished, but well overdue for
> feedback from people who know what they're doing with CouchDB and Erlang.
> :)
>
> ****************************
> My bosses are open to a short term contract with an Erlang hacking CouchDB
> committer to get this into a safely usable state ASAP.
> ****************************
>
> == Intended Usage
>
> To create a nested view, add "views" inside of a view like so:
>
>     {
>       "_id": "_design/ddoc",
>       "language": "javascript",
>       "views": {
>         "foo_count": {
>           "map": "function (doc) { emit(doc.foo, 1); }",
>           "reduce": "_sum",
>           "views": {
>             "sorted_by_count": {
>               "map":"function(doc) { emit(doc.value, nil); }"
>             }
>           }
>         }
>       }
>     }
>
> The parent row is reduced with group_level=exact, and each reduced row is
> sent to the nested view map functions as a document that looks like so:
>
>     {
>       "_id": json_encode(parent_view_key),
>       "key": parent_view_key, "value": reduced_value,
>     }
>
> Nested views would be queried using the standard view API like so:
>
>     GET /db/_design/ddoc/_view/foo_count/_view/sorted_by_count?query=args
>
> Using include_docs=true would return the reduced parent row as the doc.
>
> == Comparison to Cloudant's "dbcopy"
>
> The trade-offs between this approach and chaining to a different design doc
> or a different database are similar to the trade-offs of storing all of
> your
> views in one design doc versus a view per design doc:
>
>  * simplicity of one design doc vs multiple
>  * consistency within design doc index vs eventual consistency
>  * high vs low latency between doc insertion and view availability
>  * entire index is invalidated when one function changes
>  * some views may need to be compacted more frequently than others
>
> I have some application code that could take advantage of each approach,
> and
> ultimately I'd like to see both styles supported natively by Apache
> CouchDB.
>
> == Implementation notes
>
> New record fields: #mrview{path=[], nested_id_btree=nil}.
>
> Each nested view creates a new View#mrview object which is stored directly
> on the #mrstate.views list.  The view state saved in the #mrheader will
> include a second btree in parent views.  View#mrview.nested_id_btree keeps
> track of ToRemove entries for incremental updates (analogous to
> #mrstate.id_btree).  View#mrview.path represents the "path" of parent
> views from
> the root as a list.  It will be used used (by
> couch_mrview_util:get_view/4) for
> view lookup, influences the sort order of the views in #mrst.views,
> simplifies
> include_docs, detects "safe" top-level views during the initial mapper
> index
> updater pass, and ensures uniqueness when comparing one view to another
> (e.g.
> views could share their map definitions, but have different data sources).
>  If
> two different paths share identical function code, they will still result
> in two
> View indexes.
>
> In origin/master, couch_mrview_updater:update/2 starts up two linked
> processes from the parent index updater process; one for mapping, one for
> writing (reductions stored in the btree are handled by the couch_btree
> module).  I've created a third process to handle the nested mapping.  The
> writer
> process traverses the view tree depth-first, and should send the reduced
> rows
> (group_level=exact) to the nested mapper process.  The writer process
> waits for
> the nested results before continuing, because otherwise the mrstate would
> claim
> a higher update_seq than the nested views have completed indexing and the
> index
> file would be temporarily inconsistent.
>
> My (incomplete) code is at http://github.com/nevans/couchdb (in master).
>  I did
> a lot of refactoring just so I could understand the CouchDB code better
> and to
> test my knowledge of Erlang.  Most of these were splitting up large
> functions,
> renaming variables, and changing recursive functions into higher order
> functions
> passed to lists:foldl (I have the same aversion to explicit looping in
> ruby when
> Enumerable#inject can do the job).  I noticed that master doesn't contain a
> single merge commit(!) so obviously this would need to be cleaned up a lot
> (cosmetic refactorings could be backed out or placed in separate commits)
> if
> there is ever any interest in merging it.
>
> == Implementation progress
>
> First-pass "done" so far:
>
>  * Tests!
>    * First do no harm: still passing original tests.
>    * Stupid basic test that creates a single simple nested view
>      and tries to update it
>  * Creation
>    * Parse the design docs to create the nested views.
>  * Updates
>    * Traversal of nested views by writer process.
>    * Visiting each nested view by nested mapper process.
>      * although its buggy and crashes when I send in sample data.
>  * Retrieval
>    * nothing yet...
>
> Still incomplete in my branch (with some questions):
>
>  * Tests!
>    * more complicated nested views
>    * queried nested views
>    * queried with include_docs
>
>  * Creation
>    * Add the nested views to the mrheader; read them in and write them out.
>      couch_mrview_util:init_state/4
>      couch_mrview_util:open_view/5
>    * Initialize the nested_id_btree (with appropriate Reduce, Less, etc).
>      couch_mrview_util:open_view/5
>    * Consider nested view defs in computing the index file's MD5.
>      couch_mrview_util:ddoc_to_mrst/2
>      Is this automatically accomplished by updating #mrst.views and
> #mrview?
>
>  * Updates
>    * Update sample data without crashing!
>    * Pull in the group/reduced rows and format them for the nested mapper.
>      * This keeps stumping me.
>
>  * Retrieval
>    * Query nested views at all.
>    * Query nested views with include_docs.
>
>  * Etc (does this matter?)
>    * couch_mrview_util:calculate_data_size/2
>    * ???
>
> == Questions
>
> As an Erlang newbie hacking on unfamiliar source code, I hope I haven't
> made
> extra work for myself by going with this approach vs a view changes feed
> multi-db Cloudant-style approach...  Is there anything incredibly stupid
> about
> this approach?  :-)
>
> Any thoughts on the intended API?
>
> I'm having a hard time querying the view in its current state from the
> internals
> of couch.  I've got the partially updated View state object with its
> updated
> Btrees, which is *not* the same View and Btree that would come back if I
> were to
> use couch_mrview_util:get_view(Db, DDoc, Vname, Args).  But I haven't been
> able
> to successfully hack together a version of couch_mrview:query_view that
> will do
> what I want.
>
> Can incremental updates for the most common use case (alternate sort
> indexes) be handled by something simpler than a completely generic
> iterative
> map reduce?
>
> I've branched from origin/master.  If I want to put this into production
> for
> myself as soon as its "done", should I use the 1.3 branch instead?
>
> I've probably created some performance regressions in the nested view
> traversal.
> Any thoughts there?  How should I test that (other than by timing my own
> datasets)?  Right now, getting this done is more important for us than
> getting
> it done optimally.
>
> Is there a "view index version" number that should be incremented
> somewhere,
> so the new mrheader format doesn't break old views?  Is there a good way to
> update a view file in place (e.g. see an old header and rewrite a new
> header)?
>
> Since the nested map and nested write queues never have more than one entry
> at a time, should I even bother with the overhead of couch_work_queue?
> Should I just use the plain old Erlang "send message receive response"
> pattern instead?  Since it is done synchronously, should it even use a
> separate process?  To avoid CPU bottlenecks, it like "merge_results" should
> probably also be in its own process?
>
> On a side not, it seems wasteful to constantly be passing source code to
> the
> view servers.  Have you considered using a pattern like redis's evalsha
> scripting?  First check (using the checksums for the functions) to see if
> the
> functions have been loaded on that view server.  If not then load them.
>  All
> other references to functions use the checksum instead.  Perhaps
> Spidermonkey is
> smart enough to keep an internal function cache, but it still seems like
> non-trivial IO bandwidth could be wasted on large function bodies.
>
>
> Since you've read this far, a reminder: My bosses are open to a short term
> contract with an Erlang hacking CouchDB committer to get this into a safely
> usable state ASAP.
>
> Thanks for any help!  I'll also be available in #couchdb and #couchdb-dev.
>
> Nick Evans
>



-- 
NS

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message