couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nicholas a. evans" <n...@ekenosen.net>
Subject Options for Iterative Map Reduce
Date Wed, 12 Dec 2012 17:50:52 GMT
I've got some views that simply must use iterative map reduce.  The
greatest need is simply to sort based on the value of the first reduction.
 I'm looking over my options, and I'm going to list them here.  I'm looking
for someone to tell me that I've missed an option, or recommend a
modification to one of the below approaches.

0) Just do the second map reduce in client code...
1) use Cloudant's Chained Map Reduce feature
2) use couch-incarnate
3) Skip CouchDB and learn Hadoop or Storm...
4) learn me some Erlang for great patches...
5) roll my own by copying reduced view results into another DB...

(tl;dr: steps 0-4 don't work for me; I'll detail several approaches to step
5 but none let me relax.)


Option 0 is what we're doing now.  For our small users, it's great; for
medium users, it's okay.  But for some of our more complicated views on our
larger DBs, "GET /user_123/_design/foo/_view/bar?reduce=true&group=true"
downloads several MB of data and takes 2 minutes or more, which just
doesn't work.  We can play some neat tricks with Etags and caching
strategies that mimic "stale=update_after" (i.e. immediately grab the
latest from the cache and fire off a background process to update the
cache).  We can then also cache the results of the second mapping.  But
this is annoyingly elaborate and shares the same latency downside as the
naive approach (mentioned later).

As a tangential point, I do wonder why it takes 2 minutes to download
several MB of data, even when run via curl to localhost.  It's not
bandwidth limited, so why does it take so long to construct and send the
JSON?

For reasons I won't bore you with, Cloudant or couch-incarnate are not
(currently) options.  But both approaches seem basically reasonable...
Anyway, it's just a real shame that Apache CouchDB can't do this out of the
box.

My one boss is of the opinion that a RDBMS can handle this and everything
else, so why are we wasting time with this NoSQL nonsense? (Ignoring that
CouchDB made most other bits easier and faster, excepting this one glaring
issue.)h  And my other boss is of the opinion that we should switch to
using Storm (which I don't have any experience with, so I can't properly
assess what it would make easier or harder).  Rewriting our current code
base to use Postgresql, Hadoop, or Storm sounds like a lot of work for me,
when all I want from CouchDB is efficient sorting based on the map/reduce
values.


It seems to me that it should be relatively simple to modify CouchDB chain
views within a single design document.  You'd need to:
 1) choose a syntax to say that one view depends on another, e.g.
"sorted_by_foo": {"depends": "bar_view", "map": "function(doc) { ... }"},
or maybe even a simpler "sorted_by_foo": { "depends": "bar_view",
"sort_by": "function (k,v) { return [v.foo, k]; }"}
 2) construct a DAG to run the views in the correct order (or detect cycles
and throw an error).
 3) collect view changes in order to run the chained views incrementally.
 4) feed the changes from one view into the map (or sort) function for the
dependent view(s).

It *seems* simple to me... but I know *nothing* about the internals of
CouchDB or Erlang, so I'm probably oversimplifying or missing some steps.
 :)  Of course, if I had a view changes feed, I could put that to work for
other efficient chaining mechanisms, too.

Thinking of more elaborate changes to couch, since sorting by value is
probably the most common chained map/reduce use case, it would really be
nice to simply have multiple sorts on a single map/reduce function, with no
need to duplicate view data or go to the extra work of creating chained
views.  e.g. "foo_view": { "map": "...", "reduce": "...", "sort_by":
{"bar": "function (k,v) { ... }", "baz": "function (k,v) { ... }"}}; which
could then be queried by an API like
"/db/_design/blah/_view/foo/_sort_by/bar?reduced=true&group=true".  From a
user's point of view, this would be most relaxing... but I imagine that
this would require far more changes to couch internals and the view file
structure than my previous proposal.


So it looks like I'm stuck with rolling my own with the tools that CouchDB
1.2 provides.  Here is what I've come up with (none of which leaves me
feeling particularly relaxed):


Simple (Naive) approach:
  SourceDB => ChainedDB

  1) GET all keys from reduced/grouped view in SourceDB.
  2) GET all docs in ChainedDB.
  3) Bulk update all docs in ChainedDB.
  4) Place secondary views on ChainedDB.

This feels conceptually similar to what Cloudant is doing (with "dbcopy").

Downside: high latency.  It's okay for the *initial* load to take a
while, but 2 minutes or more is *way* too long (and wasting far too much
CPU/IO) to wait for minor changes in SourceDB to trickle through to
ChainedDB.  Most of the time (for my use case), a single row change in
SourceDB will result in only a single row change in ChainedDB, so this can
and should be done more intelligently.

Also, steps 2+3 are to delete missing rows, only update the rows that have
changed, with the correct _rev.  It might be sped up a little bit by
creating a list function that returns bulk_docs format, posting bulk_docs
with all_or_nothing, and a separate delete+conflict cleanup phase.  But
that doesn't address the core issue of the slow get/update *everything*
approach.

Is there a faster way of doing this approach, that preserves its simplicity?


Incremental changes approach (the couch-incarnate way, as I understand it):
  SourceDB => ChainedMapDB => ChainedReduceDB

  1) GET changes to SourceDB from changes feed.
  2) In my own process (not CouchDB), run the map function on the changed
docs.  Include SourceDB _id as metadata in new rows.
  3) Use a special view in ChainedMapDB to find rows affected by the
changed source docs.
  4) Bulk create/update/delete the affected rows in ChainedMapDB.
  5) GET changes to ChainedMapDB from changes feed.
  6) It will be obvious which rows need to change in ChainedReduceDB.
  7) Place the reduce function onto ChainedMapDB (with a trivial identity
map function).
  8) GET changed rows from reduced/grouped view in ChainedMapDB.
  9) Bulk create/update/delete changed ChainedReduceDB.
  10) Place secondary views on ChainedReduceDB.

Major downsides:
    * Complicated code to manage syncing source=>map=>reduce DBs.
    * Managing the map function external to Couch.
    * Three databases for one map/reduce/map chain.
    * This is *not* relaxing.  It does make me want to learn Hadoop or
Storm.

Is there a way to simplify this?


Modified incremental approach:
  SourceDB => ChainedDB

Like the couch-incarnate approach, but map/reduce function lives in
SourceDB, modified to include  metadata.

  0) Original map function is modified so that 'emit(key, value);'
transforms into 'emit(["data", key], value); emit(["metadata", doc._id],
key);' and original reduce function is modified to skip metadata rows and
use the original key for data rows.
  1) GET changes to SourceDB.
  2) query view using ["metadata", changed.id] keys; which tells you which
rows in the ChainedDB need to be updated and which rows to grab from this
view.
  3) query reduced/grouped view using ["data", ...] keys from previous step
to get new values
  4) Bulk create/update/delete ChainedDB.
  5) Place secondary views on ChainedDB.

Downsides:
  * need to modify original map function (although a relatively trivial
substitution).
  * need to modify original reduce function (although relatively trivial
insertion).
  * As far as I can tell, no one has used this approach yet.  Untested.

If no one has any better ideas, this is what I'll wind up trying.


It seems to me that a view changes feed would make this much simpler.  You
could easily get the simplicity of the naive approach with the speed of the
incremental approaches.  I thought I read about a patch for view changes
floating around, but some quick searching didn't turn it up.  Does it apply
cleanly to 1.2 (or 1.3)?  I'll consider using it... but is there a reason
it hasn't already been accepted into master?


Thanks for reading this far, and thanks for any input you might have.
-- 
Nick Evans

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message