couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <>
Subject Re: [DISCUSS] Implementing Mango Indexes for FoundationDB
Date Fri, 12 Apr 2019 03:11:08 GMT
Thanks Alex!

I didn’t mention it on the mailing list, but in the RFC for the “exploded KV” document
storage I did propose a 1MB document size limit. We’re frankly overdue for a restriction
of that nature.

Continuing on that theme, we also do not currently have an enforced limit on the number of
indexes that could be created in a single database. I remember seeing a database a few years
ago that had over 250 active indexes. That ... wasn’t pretty. I would happily support introducing
a limit there, though I don’t yet have a good sense of where to place that limit to minimize
the disruption for existing users while staying safely within FDB limits.

Great tip on the use of addWriteConflictRange, that hadn’t occurred to me at all.


> On Apr 11, 2019, at 9:52 PM, Alex Miller <> wrote:
> There was some discussion on the FDB forums previously about how to do index backfill,
> If you hadn’t seen it before, but it’s already reasonably similar to what you’ve
> There's also two related topics that I'd like to provide some context and flavor on:
> 1. The 10MB transaction size limit.
> 2. The performance/latency/throughput implications of writing to multiple
>    disjoint keys versus multiple nearby keys.
> ### Transaction Size Limit
> There was some previous concern surrounding the FDB 10MB transaction limit,
> what it means for CouchDB, and how to communicate the effective maximum
> document size to users.  Index-on-write will impact this as well, as index
> writes will be counted against the 10MB limit.
> As I don't believe I've seen a precise definition yet of the maximum size of a
> document allowed in CouchDB-on-FDB, let us assume it's defined as "whatever FDB
> will accept".  It would then be possible that a document can be inserted that
> just barely fits within the size limits, and then indexes can be defined that
> apply to the document.  One is now left with a document that exists, and is
> valid in the database, but couldn't be read and re-inserted into the database,
> because it would exceed the size limit.  My largest concern would be that this
> would have bad implications on CouchDB replication.
> Although the transaction limit is 10MB, it's generally advisable to aim to stay
> under 1MB for the cluster to perform well.  I've generally seen latency start
> to get strange for all clients when one client starts exceeding ~5MB.  If one
> requires the total size of a document to stay less than 1MB, then this
> comfortably leaves room for indexes in the commit request, and the above case
> would be relatively unlikely.
> Exactly how unlikely involves dissecting exactly how the transaction size limit
> is calculated.
> The 10MB limit applies to the size of the commit request, and not to the sum of
> the key-value pair sizes, as one might expect.  The largest difference between
> the two meanings is that read and write conflict ranges count against the 10MB
> limit.  The rough formula for calculating commit size is:
> $$ 2*len(keys read) + 3*len(keys written) + len(values written) $$
> Each `get()` adds a read conflict range of the read key and the sequentially
> next key.  Each `set()` adds a write conflict range the same way, in addition
> to having to send the actual key and value.  This is the source of the `2*` and
> `3*` multipliers.
> `getRange()` adds a read conflict range that exactly reflects the requested
> range.  However, there is no `setRange()` to automatically optimize the write
> conflict ranges added for inserting sequential values.  If one is inserting
> keys that given the data model represent one collective, sequential unit, such
> as a full document exploded across multiple key-value pairs, it's advisable to
> use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each individual
> write conflict range into one large write conflict range, and thus greatly
> reduce the size of the commit request that will be sent.
> So using this optimization, blindly inserting a document using the scheme
> proposed by the "exploded KV" approach, one would end up with a formula looking
> more like
> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values written) $$
> With index-on-write, we now need to add inserting (prefix/value/key, "") for each
> index entry into the equation:
> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values written) +
>   number of indexes * 3 * ( len(prefix) + len(one value) + len(one key) ) $$
> And there's no similar write conflict range optimization that can be done, as
> index writes will be non-sequential and spread widely across the keyspace.
> Index types that can cause a large number of entries to be written will have an
> even larger affect on total commit size.  I’ve been using the mental example of
> indexing all the elements of an array nested deep within a document.  However,
> if I’ve read CouchDB documentation correctly, all of the index types that can
> cause this explosion are grouped as part of Mango Search indexes, and thus out
> of scope for this proposal.  Mango JSON indexes seem to have a direct correlation
> that one (index,field) pair will turn into one Key-Value write to FDB. Convenient!
> Therefore, in order for a user to run into Mango JSON index-induced transaction
> size limits, they would have to manually add an astounding number of indexes.
> I've failed to quickly find an answer as to if there's an existing maximum
> number of Mango JSON indexes that can be added in a CouchDB instance, but if
> there is, this might not be a concern anyway.
> The worst case would be that each field of a document is indexed.  Assuming the
> "exploded KV" proposal as the representation, each key and value would be
> repeated 3 more times, and thus this would turn into
> $$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys written) + 4*len(values
written) $$
> So if an maximum exploded KV size of 1MB is chosen, transaction commit requests
> would be expected to stay roughly under 4MB.
> However, if you're thinking into the future, and considering that whatever
> choice is taken for Mango JSON indexes should also be taken for Mango Search
> indexes as well, the above is definitely a concern.
> ### Indexing Write Penalty
> And, as we're now discussing what would happen with thousands of indexes
> defined, let's briefly discuss what impact that would have on actually
> committing that transaction.
> With most distributed databases, each index added equates to an additional
> shard of data involved in the transaction, and thus likely an additional server
> and RPC.  This would lead to an increase in latency and concurrency control
> costs, as more indexes are added.  This adds a strong cost to each additional
> index.  One much larger than the actual cost of the disk IO involved in doing
> the actual index write.
> FoundationDB does all commits against a distributed write-ahead-log, and
> asynchronously to the commit, applies the data to each shard of data involved in
> the transaction.  This means that a commit of 5 adjacent keys and 5 keys
> randomly distributed across the keyspace will be processed in the same way by
> the transaction subsystem.  One would actually see better throughput at
> saturation when writing to randomly distributed keys, as the work of applying
> those mutations gets sharded across multiple storage servers.
> Creating many indexes still means more disk IO, but is comparatively cheaper on
> FDB than other distributed database architectures.
>> On Apr 11, 2019, at 5:42 AM, Garren Smith <> wrote:
>> I was chatting to Adam yesterday and I want to explore the index-on-write
>> indexing for Mango a bit more. I know there has been a bit of a discussion
>> that we should only use a background process to build mango indexes but I
>> think that building indexes as documents are created/updated along combined
>> with background processing for existing documents will be easier to
>> implement. Especially in the beginning as we build the new fdb layer.
>> Below is the process for building a new index:
>> 1. When a user defines a new index on an existing database, save the index
>> definition and also save the sequence that the index was added at. The
>> index should also be marked that it is in a `building`  phase so it won’t
>> be used yet to service queries. (I’ll come back to this later)
>> 2. Any write requests after that must read the new index definition and
>> update the index. When updating the new index, the writers should assume
>> that previous versions of the document have already been indexed.
>> 3. At the same time a background process will start reading sections of the
>> changes feed and building the index, this background process will keep
>> processing the changes read until it reaches the sequence number that the
>> index was saved at. Once it reaches that point, the index is up to date and
>> will be marked as `active` and can be used to service queries.
>> 4. There are some subtle behaviour around step 3 that is worth mentioning.
>> The background process will have the 5 second transaction limit, so it will
>> process smaller parts of the changes feed. Which means that it won’t have
>> one consistent view of the changes feed throughout the index building
>> process. This will lead to a conflict situation. For example when the
>> background process transaction is adding a document to the index while at
>> the same time a write request has a transaction that is updating the same
>> document. There are two possible outcomes to this, if the background
>> process wins, the write request will get a conflict. At that point the
>> write request will try to process the document again, read the old values
>> for that document, remove them from the index and add the new values to the
>> index. If the write request wins, and the background process gets a
>> conflict, then the background process can try again, the document would
>> have been removed from its old position in the changes feed and moved to
>> the later position, so the background process won’t see the document and
>> will then move on to the next one.
>> 5. One other feature to add is to an index progress tracker. We can do this
>> by using doc_count for the database, and then have a counter value that the
>> background workers can increment with the number of documents it updated
>> for each batch update.  We would also have to update this counter on write
>> requests while the index is in building mode.
>> 6. Something we can also explore is splitting the building of the index
>> across multiple workers, we can use the `get_boundary_keys` [1] API call on
>> the changes feed to get the full list of the changes feed keys grouped by
>> partition boundaries and then split that by workers.
>> Adding a building and active state to the index’s is a bit of a breaking
>> change, but I think its the right way to go. Currently with Mango what can
>> happen is a user creates an index and then immediately does a query that
>> would use that index. Mango would then have to build the whole index before
>> responding to that request. In this new index-on-write process, Mango would
>> ignore the new index until it is active which I think is the better way to
>> go on this.
>> Finally, a big acknowledgment to Adam who is the major brains behind this
>> design.
>> What do you think, I would like to hear any thoughts, questions or
>> suggestions on this design.
>> Cheers
>> Garren
>> [1]
>>> On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <> wrote:
>>>> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <>
>>>> Hi Will, great comments, I have replies to a couple of them.
>>>>> On Apr 1, 2019, at 5:21 AM, Will Holley <>
>>>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
>>>>> whether we can guarantee that the generated keys will fit within the
>>>>> maximum FDB key length or if there needs to be some thought as to the
>>>>> failure mode / workaround. As Adam mentioned, it seems fine to store
>>>>> encoded key given Mango (currently) always fetches the associated
>>>> document
>>>>> / fields from the primary index to filter on anyway. It might even be
>>>>> beneficial to have an additional layer of indirection and allow multiple
>>>>> docs to be associated with each row so that we can maintain compact
>>>> keys.
>>>> Interesting thought on that layer of indirection; it reminds me of an
>>>> optimization applied in the Record Layer’s text indexes. Would have to
>>>> compare whether the extra reads needed to maintain the index that way are
>>>> an acceptable tradeoff.
>>>> Good point on the sort key sizes, I’ve not seen any way to place a
>>>> reliably safe upper bound on the size of one that might be generated. The
>>>> ICU folks have some hand-wavey guidance at
>>>> but it seems like we might be able to dig a little deeper.
>>>> I personally haven’t given much thought to a workaround where a
>>>> user-defined index key exceeds 10 KB. We’ll definitely need to handle that
>>>> failure mode safely even without the sort key complication — people try
>>>> crazy things :)
>>> For the 10 KB error, I think we should just return an error. As a
>>> comparison, MongoDB has a 1024 Byte limit
>>>>> 3. I don't immediately see how you clear previous values from the index
>>>>> when a doc is updated, but I could easily be missing something obvious
>>>> :)
>>>> Ah yeah, this part wasn’t explicit, was it?
>>>> I think the idea is that these are simple indexes on specific fields of a
>>>> document, and we have a data model where those fields are already stored
>>>> their own keys in FDB, so there’s no need (in the case of Mango) to
>>>> maintain a separate docid -> {viewid, [keys]} mapping like we do today
>>>> each view group. Rather, the flow would go something like
>>>> 1) Check which fields are supposed to be indexed
>>>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the
>>>> parent revision
>>>> 3) Compare the parent values with the ones supplied in this transaction;
>>>> if any indexed values change, clear the old ones and insert the new ones
>>>> with some additional caveats around checking that the supplied edit is
>>>> actually going to be winning (and therefore indexed) version after the
>>>> commit succeeds.
>>>>> 4. Regarding "Index on write" behaviour, is there something in the
>>>> existing
>>>>> design (Mango overlaying mrview / lucene) that would prevent this? I
>>>>> see some benefit for certain workloads (and headaches for others) but
>>>>> don't see that it's necessarily coupled to the Mango design given
>>>>> background indexing of new/changed indexes needs to be supported anyway.
>>>> I’m not sure I understand your question. In my mind the reason “index
>>>> write" is more applicable for Mango JSON than for generalized views is
>>>> because in the view case batching is currently quite important to achieve
>>>> good throughput to the JS system. You’re of course correct that we need
>>>> be able to re-generate Mango JSON indexes in the background as well.
>>>> Adam

View raw message