couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nicholas a. evans" <n...@ekenosen.net>
Subject Re: Nested views: iterative map reduce implementation
Date Fri, 15 Mar 2013 20:47:52 GMT
Yes, there is *totally* stuff I need assistance with. :)  This was my
first attempt at learning Erlang, and I worked myself into a couple of
dead ends and blindly coded myself into a few corners.  It's been a
few weeks since I looked at this, so it would take a little while for
me to get back up to speed on it myself.  I had at least two really
big questions or bugs that were leaving me stumped... but I don't
remember what they were.  ;)  I'd need to scan through to remember
what was working and what was broken and what was simply
un-implemented when I left off.

I *finally* convinced my bosses to give Cloudant a chance (which we
have happily switched over to during the last few weeks), and we're
going to be using their dbcopy implementation of incremental iterative
map reduce.  That removed both the urgency and (more importantly) the
ability to work on it during business hours.  But I *am* interested in
getting this (or something very like it) into core CouchDB.

--
Nick

On Fri, Mar 15, 2013 at 3:58 PM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> Definitely been meaning to provide a review for this. I'll try and get
> to it in a few hours.
>
> On Fri, Mar 15, 2013 at 2:36 PM, Noah Slater <nslater@apache.org> wrote:
>> Hey Nick,
>>
>> This looks pretty great. Is it something you need assistance with? Anything
>> I can do to help?
>>
>> Thanks,
>>
>>
>> On 22 January 2013 15:18, nicholas a. evans <nick@ekenosen.net> wrote:
>>
>>> I've started hacking on an implementation of iterative map reduce that I'm
>>> calling "nested views".  It's not nearly finished, but well overdue for
>>> feedback from people who know what they're doing with CouchDB and Erlang.
>>> :)
>>>
>>> ****************************
>>> My bosses are open to a short term contract with an Erlang hacking CouchDB
>>> committer to get this into a safely usable state ASAP.
>>> ****************************
>>>
>>> == Intended Usage
>>>
>>> To create a nested view, add "views" inside of a view like so:
>>>
>>>     {
>>>       "_id": "_design/ddoc",
>>>       "language": "javascript",
>>>       "views": {
>>>         "foo_count": {
>>>           "map": "function (doc) { emit(doc.foo, 1); }",
>>>           "reduce": "_sum",
>>>           "views": {
>>>             "sorted_by_count": {
>>>               "map":"function(doc) { emit(doc.value, nil); }"
>>>             }
>>>           }
>>>         }
>>>       }
>>>     }
>>>
>>> The parent row is reduced with group_level=exact, and each reduced row is
>>> sent to the nested view map functions as a document that looks like so:
>>>
>>>     {
>>>       "_id": json_encode(parent_view_key),
>>>       "key": parent_view_key, "value": reduced_value,
>>>     }
>>>
>>> Nested views would be queried using the standard view API like so:
>>>
>>>     GET /db/_design/ddoc/_view/foo_count/_view/sorted_by_count?query=args
>>>
>>> Using include_docs=true would return the reduced parent row as the doc.
>>>
>>> == Comparison to Cloudant's "dbcopy"
>>>
>>> The trade-offs between this approach and chaining to a different design doc
>>> or a different database are similar to the trade-offs of storing all of
>>> your
>>> views in one design doc versus a view per design doc:
>>>
>>>  * simplicity of one design doc vs multiple
>>>  * consistency within design doc index vs eventual consistency
>>>  * high vs low latency between doc insertion and view availability
>>>  * entire index is invalidated when one function changes
>>>  * some views may need to be compacted more frequently than others
>>>
>>> I have some application code that could take advantage of each approach,
>>> and
>>> ultimately I'd like to see both styles supported natively by Apache
>>> CouchDB.
>>>
>>> == Implementation notes
>>>
>>> New record fields: #mrview{path=[], nested_id_btree=nil}.
>>>
>>> Each nested view creates a new View#mrview object which is stored directly
>>> on the #mrstate.views list.  The view state saved in the #mrheader will
>>> include a second btree in parent views.  View#mrview.nested_id_btree keeps
>>> track of ToRemove entries for incremental updates (analogous to
>>> #mrstate.id_btree).  View#mrview.path represents the "path" of parent
>>> views from
>>> the root as a list.  It will be used used (by
>>> couch_mrview_util:get_view/4) for
>>> view lookup, influences the sort order of the views in #mrst.views,
>>> simplifies
>>> include_docs, detects "safe" top-level views during the initial mapper
>>> index
>>> updater pass, and ensures uniqueness when comparing one view to another
>>> (e.g.
>>> views could share their map definitions, but have different data sources).
>>>  If
>>> two different paths share identical function code, they will still result
>>> in two
>>> View indexes.
>>>
>>> In origin/master, couch_mrview_updater:update/2 starts up two linked
>>> processes from the parent index updater process; one for mapping, one for
>>> writing (reductions stored in the btree are handled by the couch_btree
>>> module).  I've created a third process to handle the nested mapping.  The
>>> writer
>>> process traverses the view tree depth-first, and should send the reduced
>>> rows
>>> (group_level=exact) to the nested mapper process.  The writer process
>>> waits for
>>> the nested results before continuing, because otherwise the mrstate would
>>> claim
>>> a higher update_seq than the nested views have completed indexing and the
>>> index
>>> file would be temporarily inconsistent.
>>>
>>> My (incomplete) code is at http://github.com/nevans/couchdb (in master).
>>>  I did
>>> a lot of refactoring just so I could understand the CouchDB code better
>>> and to
>>> test my knowledge of Erlang.  Most of these were splitting up large
>>> functions,
>>> renaming variables, and changing recursive functions into higher order
>>> functions
>>> passed to lists:foldl (I have the same aversion to explicit looping in
>>> ruby when
>>> Enumerable#inject can do the job).  I noticed that master doesn't contain a
>>> single merge commit(!) so obviously this would need to be cleaned up a lot
>>> (cosmetic refactorings could be backed out or placed in separate commits)
>>> if
>>> there is ever any interest in merging it.
>>>
>>> == Implementation progress
>>>
>>> First-pass "done" so far:
>>>
>>>  * Tests!
>>>    * First do no harm: still passing original tests.
>>>    * Stupid basic test that creates a single simple nested view
>>>      and tries to update it
>>>  * Creation
>>>    * Parse the design docs to create the nested views.
>>>  * Updates
>>>    * Traversal of nested views by writer process.
>>>    * Visiting each nested view by nested mapper process.
>>>      * although its buggy and crashes when I send in sample data.
>>>  * Retrieval
>>>    * nothing yet...
>>>
>>> Still incomplete in my branch (with some questions):
>>>
>>>  * Tests!
>>>    * more complicated nested views
>>>    * queried nested views
>>>    * queried with include_docs
>>>
>>>  * Creation
>>>    * Add the nested views to the mrheader; read them in and write them out.
>>>      couch_mrview_util:init_state/4
>>>      couch_mrview_util:open_view/5
>>>    * Initialize the nested_id_btree (with appropriate Reduce, Less, etc).
>>>      couch_mrview_util:open_view/5
>>>    * Consider nested view defs in computing the index file's MD5.
>>>      couch_mrview_util:ddoc_to_mrst/2
>>>      Is this automatically accomplished by updating #mrst.views and
>>> #mrview?
>>>
>>>  * Updates
>>>    * Update sample data without crashing!
>>>    * Pull in the group/reduced rows and format them for the nested mapper.
>>>      * This keeps stumping me.
>>>
>>>  * Retrieval
>>>    * Query nested views at all.
>>>    * Query nested views with include_docs.
>>>
>>>  * Etc (does this matter?)
>>>    * couch_mrview_util:calculate_data_size/2
>>>    * ???
>>>
>>> == Questions
>>>
>>> As an Erlang newbie hacking on unfamiliar source code, I hope I haven't
>>> made
>>> extra work for myself by going with this approach vs a view changes feed
>>> multi-db Cloudant-style approach...  Is there anything incredibly stupid
>>> about
>>> this approach?  :-)
>>>
>>> Any thoughts on the intended API?
>>>
>>> I'm having a hard time querying the view in its current state from the
>>> internals
>>> of couch.  I've got the partially updated View state object with its
>>> updated
>>> Btrees, which is *not* the same View and Btree that would come back if I
>>> were to
>>> use couch_mrview_util:get_view(Db, DDoc, Vname, Args).  But I haven't been
>>> able
>>> to successfully hack together a version of couch_mrview:query_view that
>>> will do
>>> what I want.
>>>
>>> Can incremental updates for the most common use case (alternate sort
>>> indexes) be handled by something simpler than a completely generic
>>> iterative
>>> map reduce?
>>>
>>> I've branched from origin/master.  If I want to put this into production
>>> for
>>> myself as soon as its "done", should I use the 1.3 branch instead?
>>>
>>> I've probably created some performance regressions in the nested view
>>> traversal.
>>> Any thoughts there?  How should I test that (other than by timing my own
>>> datasets)?  Right now, getting this done is more important for us than
>>> getting
>>> it done optimally.
>>>
>>> Is there a "view index version" number that should be incremented
>>> somewhere,
>>> so the new mrheader format doesn't break old views?  Is there a good way to
>>> update a view file in place (e.g. see an old header and rewrite a new
>>> header)?
>>>
>>> Since the nested map and nested write queues never have more than one entry
>>> at a time, should I even bother with the overhead of couch_work_queue?
>>> Should I just use the plain old Erlang "send message receive response"
>>> pattern instead?  Since it is done synchronously, should it even use a
>>> separate process?  To avoid CPU bottlenecks, it like "merge_results" should
>>> probably also be in its own process?
>>>
>>> On a side not, it seems wasteful to constantly be passing source code to
>>> the
>>> view servers.  Have you considered using a pattern like redis's evalsha
>>> scripting?  First check (using the checksums for the functions) to see if
>>> the
>>> functions have been loaded on that view server.  If not then load them.
>>>  All
>>> other references to functions use the checksum instead.  Perhaps
>>> Spidermonkey is
>>> smart enough to keep an internal function cache, but it still seems like
>>> non-trivial IO bandwidth could be wasted on large function bodies.
>>>
>>>
>>> Since you've read this far, a reminder: My bosses are open to a short term
>>> contract with an Erlang hacking CouchDB committer to get this into a safely
>>> usable state ASAP.
>>>
>>> Thanks for any help!  I'll also be available in #couchdb and #couchdb-dev.
>>>
>>> Nick Evans
>>>
>>
>>
>>
>> --
>> NS

Mime
View raw message