couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nicholas a. evans" <n...@ekenosen.net>
Subject Re: Erlang vs JavaScript
Date Mon, 19 Aug 2013 23:26:11 GMT
It seems like there might be several simple "internalizing" speedups,
even before tackling the view server protocol or the couchjs view
server, hinted at by Alexander's suggestion:

On Fri, Aug 16, 2013 at 3:58 PM, Alexander Shorin <kxepal@gmail.com> wrote:
> Idea: move document metadata into separate object.
...
> Case 2: Large docs. Profit in case when you have set right fields into
> metadata (like doc type, authorship, tags etc.) and filter first by
> this metadata - you have minimal memory footprint, you have less CPU
> load, rule "fast accept - fast reject" works perfectly.

For the simple case of filtering which fields are passed to the map
fn, you don't need full blown chained views, you only need a simple
way to define field filters (describing which fields are the relevant
"metadata" fields).

> Side effect: it's possible to autoindex metadata on fly on document
> update without asking user to write (meta/by_type, meta/by_author,
> meta/by_update_time etc. viiews) . Sure, as much metadata you have as
> large base index will be. In 80% cases it will be no more than 4KB.

Similarly to how the internals of couch already optimize away the case
when you have multiple views in the same design doc share the same map
function (but different reductions), we should also be able to
optimize away the case where multiple views share the same fields
filter.

> Resume: probably, I'd just described chained views feature with
> autoindexing by certain fields (:

One lesson I learned when I looked into implementing chained
map/reduce views is that they will need to be in different design_docs
from the parent views, in order to play nicely with BigCouch.  Keeping
them in the same design_doc just doesn't work with parallel view
builds (at least, not without breaking normal design_doc
considerations).  So although I really like the simplicity of the
"keep chained views in one design doc" approach, it's probably a
dead-end.

> Removing autoindexing feature and we could make views building process
> much more faster if we make right views chain which will use set
> algebra operations to calculate target doc ids to pass to final view:
> reduce docs before map results:
>
> {
> "views": {
>     "posts": {"map": "...", "reduce": "..."},
>     "chain": [
>      ["by_type", {"key": "post"}],
>      ["hidden", {"key": false}],
>      ["by_domain", {"keys": ["public", "wiki"]}]
>   ]
>  }
> }

I was inspired by your view syntax and thought I'd put forward my own
similar proposal:

{
  "_id": "plain_old_views_for_comparison",
  "views": {
    "single_emit": {
      "map": "function(doc) { if (!doc.foo) { emit([doc.bar, doc.baz],
doc.quux); } }",
      "reduce": "_count"
    },
    "multiple_emits": {
      "map": "function(doc) { if (!doc.foo) { emit([0, doc.bar],
doc.quux); emit(['baz', doc.baz], doc.quux); } }",
      "reduce": "_count"
    },
}

{
  "_id": "internalized",
  "options": {
    "filter": "!foo",
    "fields": ["bar", "baz", "quux"]
   },
  "views": {
    "single_emit_1": {
      "map": "function(doc) { emit([doc.bar, doc.baz], doc.quux); }",
      "reduce": "_count"
    },
    "single_emit_2": {
      "map": { "key": ["bar", "baz"], "value": "quux" },
      "reduce": "_count"
    },
    "multiple_emits": {
      "map": { "emits": [[0, "bar"], "quux"], ["'baz'", "baz"], "quux"]] },
      "reduce": "_count"
    },
}

Where the above views should behave the same way.  The view options
would support "filter" as a guard clause and "fields" to strip out all
but the relevant metadata.  These should be defined at the design
document level to simplify working with the current view server
protocol.  And the view "map" could optionally be an object describing
the emit values instead of a function string.

The filter string should be simple but powerful: I'd suggest
supporting !, &&, ||, (), "foo.bar.baz", >, <, >=, <=, ==, !=,
numbers, and strings (for "type == 'foo'"). But even if all it
supported was "foo" and "!foo", it would still be useful.  In some
cases, this will prevent most docs from ever needing to be evaluated
by the view server.  The "fields" array might also consider filtering
nested fields like with "foo.bar.baz".  The "filter" and internal map
("key", "value", "emits") should support the same values that "fields"
supports plus numbers and strings, or they could support the same
syntax as "filter" to do things like "key": ["!!deleted_at",
"deleted_at"].  The "filter" and internal map would be able to use all
of the fields, not just the ones defined in the options.

Another odd case where I've personally noticed indexing speed get
immensely bogged down is when the reduce function merges the map
objects together.  I've seen views with this problem go up to 5GB
during initial load and compact back down to 20MB.  I've documented
this problem and my workaround here:
https://gist.github.com/nevans/5512593.  The hideously reduce pattern
in that gist has resulted in 2-5x faster view builds for me (small DBs
infinitesimally slower, huge speedup for big DBs).  But it would be
*much* better to simply add a "minimum_reduced_group_level" option to
the view, and let the Erlang handle that without doing unnecessary
view server round trips and hideously complicated reduce functions.
Any group_level below the minimum_reduced_group_level would simply
return "null" for all of the values.

This isn't a trivial proposal, but it can be implemented completely
independently of any view server protocol or couchjs changes.  And
even a simplified version could still yield major speedup for some of
the most common map patterns, just as "_sum" and "_count" speed up the
most common reduce functions.  Also, the individual pieces can be
implemented independently:  If I were to work on this myself (probably
not going to happen in the next month or two), I'd do
"minimum_reduced_group_level" first and "filter" second, since I think
that's were *my* biggest bang for the buck would be.  But other
people's dataset (e.g. large docs) might get the biggest improvement
from "fields".  And if you have lots of simple map functions, you
might get the biggest speedup from the internal map "key", "value",
"emits".

What do you think?  Ugly and untenable?  Or a shot in the right direction?


Also, I know that Jason already yielded on the O(N) argument, but I
got here late and wanted to add my $0.02:  Obviously anything better
than O(N) is impossible when you need to map N documents.  Changing to
O(N/Q) (where Q=parallelism of view indexing; e.g. throw hardware at
it) is still essentially O(N), but it's very useful and something that
BigCouch does nicely.  A 9x speedup might be the difference between a
rollout taking 90 hours (barely finishes over the weekend) and 10
hours (you can do it overnight during the week).  The longer the view
rollout period, the slower and more cautious the
development/deployment cycle becomes.  More importantly, it might be
the difference between loading a large user in 9 hours vs 60 minutes,
which will feel like a qualitative improvement to that user and is
especially important when that user is e.g. Walt Mossberg and load
time is one of two nitpicks he has in his review.  Or when you have a
hundred similarly jumbo-sized users sign up the next day.  Sorry for
piling on after the argument is over.  :)

-- 
Nick

Mime
View raw message