Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of nicholas.evans@gmail.com
 designates 209.85.216.41 as permitted sender)
MIME-Version: 1.0
Sender: nicholas.evans@gmail.com
From: "nicholas a. evans" <nick@ekenosen.net>
Date: Tue, 22 Jan 2013 10:18:24 -0500
Message-ID: 
 <CACV2MxcybtnN_CrZsftw0iM9JNiLuqJs5QHaYZbh6efk0JHjyg@mail.gmail.com>
Subject: Nested views: iterative map reduce implementation
To: dev@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1

I've started hacking on an implementation of iterative map reduce that I'm
calling "nested views".  It's not nearly finished, but well overdue for
feedback from people who know what they're doing with CouchDB and Erlang. :)

****************************
My bosses are open to a short term contract with an Erlang hacking CouchDB
committer to get this into a safely usable state ASAP.
****************************

== Intended Usage

To create a nested view, add "views" inside of a view like so:

    {
      "_id": "_design/ddoc",
      "language": "javascript",
      "views": {
        "foo_count": {
          "map": "function (doc) { emit(doc.foo, 1); }",
          "reduce": "_sum",
          "views": {
            "sorted_by_count": {
              "map":"function(doc) { emit(doc.value, nil); }"
            }
          }
        }
      }
    }

The parent row is reduced with group_level=exact, and each reduced row is
sent to the nested view map functions as a document that looks like so:

    {
      "_id": json_encode(parent_view_key),
      "key": parent_view_key, "value": reduced_value,
    }

Nested views would be queried using the standard view API like so:

    GET /db/_design/ddoc/_view/foo_count/_view/sorted_by_count?query=args

Using include_docs=true would return the reduced parent row as the doc.

== Comparison to Cloudant's "dbcopy"

The trade-offs between this approach and chaining to a different design doc
or a different database are similar to the trade-offs of storing all of your
views in one design doc versus a view per design doc:

 * simplicity of one design doc vs multiple
 * consistency within design doc index vs eventual consistency
 * high vs low latency between doc insertion and view availability
 * entire index is invalidated when one function changes
 * some views may need to be compacted more frequently than others

I have some application code that could take advantage of each approach, and
ultimately I'd like to see both styles supported natively by Apache CouchDB.

== Implementation notes

New record fields: #mrview{path=[], nested_id_btree=nil}.

Each nested view creates a new View#mrview object which is stored directly
on the #mrstate.views list.  The view state saved in the #mrheader will
include a second btree in parent views.  View#mrview.nested_id_btree keeps
track of ToRemove entries for incremental updates (analogous to
#mrstate.id_btree).  View#mrview.path represents the "path" of parent views from
the root as a list.  It will be used used (by couch_mrview_util:get_view/4) for
view lookup, influences the sort order of the views in #mrst.views, simplifies
include_docs, detects "safe" top-level views during the initial mapper index
updater pass, and ensures uniqueness when comparing one view to another (e.g.
views could share their map definitions, but have different data sources).  If
two different paths share identical function code, they will still result in two
View indexes.

In origin/master, couch_mrview_updater:update/2 starts up two linked
processes from the parent index updater process; one for mapping, one for
writing (reductions stored in the btree are handled by the couch_btree
module).  I've created a third process to handle the nested mapping.  The writer
process traverses the view tree depth-first, and should send the reduced rows
(group_level=exact) to the nested mapper process.  The writer process waits for
the nested results before continuing, because otherwise the mrstate would claim
a higher update_seq than the nested views have completed indexing and the index
file would be temporarily inconsistent.

My (incomplete) code is at http://github.com/nevans/couchdb (in master).  I did
a lot of refactoring just so I could understand the CouchDB code better and to
test my knowledge of Erlang.  Most of these were splitting up large functions,
renaming variables, and changing recursive functions into higher order functions
passed to lists:foldl (I have the same aversion to explicit looping in ruby when
Enumerable#inject can do the job).  I noticed that master doesn't contain a
single merge commit(!) so obviously this would need to be cleaned up a lot
(cosmetic refactorings could be backed out or placed in separate commits) if
there is ever any interest in merging it.

== Implementation progress

First-pass "done" so far:

 * Tests!
   * First do no harm: still passing original tests.
   * Stupid basic test that creates a single simple nested view
     and tries to update it
 * Creation
   * Parse the design docs to create the nested views.
 * Updates
   * Traversal of nested views by writer process.
   * Visiting each nested view by nested mapper process.
     * although its buggy and crashes when I send in sample data.
 * Retrieval
   * nothing yet...

Still incomplete in my branch (with some questions):

 * Tests!
   * more complicated nested views
   * queried nested views
   * queried with include_docs

 * Creation
   * Add the nested views to the mrheader; read them in and write them out.
     couch_mrview_util:init_state/4
     couch_mrview_util:open_view/5
   * Initialize the nested_id_btree (with appropriate Reduce, Less, etc).
     couch_mrview_util:open_view/5
   * Consider nested view defs in computing the index file's MD5.
     couch_mrview_util:ddoc_to_mrst/2
     Is this automatically accomplished by updating #mrst.views and #mrview?

 * Updates
   * Update sample data without crashing!
   * Pull in the group/reduced rows and format them for the nested mapper.
     * This keeps stumping me.

 * Retrieval
   * Query nested views at all.
   * Query nested views with include_docs.

 * Etc (does this matter?)
   * couch_mrview_util:calculate_data_size/2
   * ???

== Questions

As an Erlang newbie hacking on unfamiliar source code, I hope I haven't made
extra work for myself by going with this approach vs a view changes feed
multi-db Cloudant-style approach...  Is there anything incredibly stupid about
this approach?  :-)

Any thoughts on the intended API?

I'm having a hard time querying the view in its current state from the internals
of couch.  I've got the partially updated View state object with its updated
Btrees, which is *not* the same View and Btree that would come back if I were to
use couch_mrview_util:get_view(Db, DDoc, Vname, Args).  But I haven't been able
to successfully hack together a version of couch_mrview:query_view that will do
what I want.

Can incremental updates for the most common use case (alternate sort
indexes) be handled by something simpler than a completely generic iterative
map reduce?

I've branched from origin/master.  If I want to put this into production for
myself as soon as its "done", should I use the 1.3 branch instead?

I've probably created some performance regressions in the nested view
traversal.
Any thoughts there?  How should I test that (other than by timing my own
datasets)?  Right now, getting this done is more important for us than getting
it done optimally.

Is there a "view index version" number that should be incremented somewhere,
so the new mrheader format doesn't break old views?  Is there a good way to
update a view file in place (e.g. see an old header and rewrite a new
header)?

Since the nested map and nested write queues never have more than one entry
at a time, should I even bother with the overhead of couch_work_queue?
Should I just use the plain old Erlang "send message receive response"
pattern instead?  Since it is done synchronously, should it even use a
separate process?  To avoid CPU bottlenecks, it like "merge_results" should
probably also be in its own process?

On a side not, it seems wasteful to constantly be passing source code to the
view servers.  Have you considered using a pattern like redis's evalsha
scripting?  First check (using the checksums for the functions) to see if the
functions have been loaded on that view server.  If not then load them.  All
other references to functions use the checksum instead.  Perhaps Spidermonkey is
smart enough to keep an internal function cache, but it still seems like
non-trivial IO bandwidth could be wasted on large function bodies.


Since you've read this far, a reminder: My bosses are open to a short term
contract with an Erlang hacking CouchDB committer to get this into a safely
usable state ASAP.

Thanks for any help!  I'll also be available in #couchdb and #couchdb-dev.

Nick Evans