couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Holley <willhol...@gmail.com>
Subject Re: [DISCUSS] Map Indexes
Date Mon, 15 Apr 2019 15:15:58 GMT
Thanks Garren,

As usual, a few questions :)

1. The data model suggests the idea of view groups gets carried over to
fdb. Are there API / behaviour reasons to keep them? Would an index update
transaction scope to a view group rather than a single view?
2. Regarding emitting doc as the value in a view function, this is so
common that I wonder if it's worth handling as a special case. It sounds
like there wouldn't be a solution for customers who use this technique to
ensure they can retrieve the version of the document that is consistent
with the emitted key?
3. When you say "Emitted keys will not be able to exceed 10 KB", do you
mean any single emitted key cannot exceed 10KB? The "id index" proposal
suggests there would also be a 100KB limit on the combined emitted key
length.

Cheers,

Will

On Mon, 15 Apr 2019 at 15:25, Garren Smith <garren@apache.org> wrote:

> Hi Everyone,
>
> I want to start a discussion around creating map/reduce view indexes. One
> way to get views indexes to work with FoundationDB is to break up a view
> index into indexes for the map functions and indexes for the reduce
> functions. Along those lines, I’m going to break the discussions into two,
> this discussion around map functions and indexes and then a another one on
> reduce functions and the indexes that go with those.
>
> ## Data model
> For a map function, we need to store the emitted keys and the emitted
> values:
>
> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?MAP, <keys>,
> <_id>} -> <emitted_value>
>
> To briefly explain what the above means, it creates a views subspace in a
> database subspace, then every view defined on a design doc is grouped via
> the design doc’s view signature. The view_id is the name of the view in the
> design doc - we can look at ways to make that smaller to save some key
> space. The ?MAP groups the key/value into the view’s map index subspace,
> then we have the keys that were emitted for the map function and finally
> the _id field of the document used to create the keys for this row.
>
> ## Emitted Value
> The value stored for the row is the emitted value from the map function.
> Because we have a limitation on the size of the value field one caveat
> around this design is that a user will run into issues if they emit a
> document that exceeds 100KB. In CouchDB we don’t recommend users emitting
> the doc, but there are some nice speed optimisations you get by emitting
> the document as the value. With CouchDB on FDB that performance
> optimisation won’t be required and so we will have to actively discourage
> users from doing that.
>
> Just to note, a user would experience the same issue if they emit a value
> exceeding 100KB.
>
> ## Key ordering
> There are some changes to how we will manage keys emitted from a map
> function. For strings we will need to generate a ICU sort string upfront
> instead of using the ICU comparison. To maintain the way CouchDB currently
> does view collation [1], we need to prepend a type value to each key so
> that we get the correct sort order of null < boolean < numbers < strings <
> arrays < objects. CouchDB currently allows duplicate keys to be emitted for
> an index, to allow for that a counter value will be added to the end of the
> keys.
>
> ## Index Key Management
> For every document that needs to be processed for an index, we have to run
> the document through the javascript process to get the emitted keys and
> values. This means that it won’t be possible to update a map/reduce index
> in the same transaction that a document is updated. To account for this, we
> will need to keep an `id index` similar to the `id tree` we current keep.
> This index will hold the document id as the key and the value would be the
> keys that were emitted. We would then use this information to know which
> fields need to be updated or removed from the index when a document is
> changed.  A data model for this would be:
>
> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?ID_INDEX, <_id>,
> <view_id>} -> [emitted keys]
>
> ## Updating an index
> To help in knowing which documents have changed since a view was last
> updated, we will need to keep the latest update sequence. This will change
> from the really long string we currently have, to using the sequence value
> defined in the _changes RFC [2].
>
> Based on all of that, a slightly expanded data model for map functions
> inside a database subspace would look like:
>
> * views
>     * <signature>
>         * update_seq
>         * idtree
>             * (<_id>, <viewid>) -> [keys]
>         * views
>             * <viewid>
>                 * map
>                     * (<key>, <_id>) -> <value>
>
> ## Size limits
> There are some size limits that are worth listing and keeping in mind.
>
> * Emitted keys will not be able to exceed 10 KB
> * Values cannot exceed 100 KB
> * Following from Alex’s email on how transaction sizes are calculated [3],
> there could be rare cases where the number of key-value pairs emitted for a
> map function could lead to a transaction either exceeding 10 MB which isn’t
> allowed or exceeding 5 MB which impacts the performance of the cluster. We
> will have to detect for those situations and split the transaction into
> smaller transactions
>
> What do you think of that? Any questions or thoughts on this? Once again a
> big acknowledgment to Adam who did the initial investigation and design
> ideas around this.
>
> Cheers
> Garren
>
> [1]
>
> http://docs.couchdb.org/en/stable/ddocs/views/collation.html#collation-specification
> [2] https://github.com/apache/couchdb-documentation/pull/401
> [3]
>
> https://lists.apache.org/thread.html/4976e0b7e3df89c5d64f37b5299b04c2ed01088f357be8aceaeedec1@%3Cdev.couchdb.apache.org%3E
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message