couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: Partition query endpoints in CouchDB 4.0
Date Wed, 13 May 2020 21:29:41 GMT
Great feedback.

Answering your tangential question: document data is not colocated with the built-in _all_docs
or _changes indexes in FDB. A request with include_docs=true will cause the CouchDB layer
to submit additional range requests to retrieve the document data for each row in the feed.

The segmented distribution of a high-throughput _changes feed to multiple consumers is something
that’s been in the back of my mind. I think we’re all excited about the totally-ordered
feed returning with CouchDB 4.0, but if the event rate gets high enough it’s not going to
be possible to serve that feed or even maintain that underlying index.

Partitions could be one answer for that, if partitions had _changes feeds (which I wrote about
in another thread). But that doesn’t address the fact that the DB-wide _changes feed has
its own scalability limits — at some point, all that data landing on the same storage server
(because of range partitioning) is going to be a bottleneck. I think what I’d really want
is a way for the server to shard the feed under the hood (like your integer slicing key),
and then a way for clients to discover the N unique URLs for the shards of the feed. The sequence
numbers in each feed would still be globally unique and totally ordered for consumers who
would need that guarantee.

I can see where the “no-longer-matches” would be useful if the number of feeds for the
database changed over time. Just off the top of my head, I think we’d to compactly convey
the range of documents that no longer match (and possibly the new location where one could
find them); emitting an entry for every one won’t work.

Cheers, Adam

> On May 13, 2020, at 2:44 PM, San Sato <sansato@inator.biz> wrote:
> 
> I understand that range-based access to fdb-based views gets efficiently
> dispatched to node(s) having the key-range, causing zero load on nodes that
> aren't having that key-range.  (tangential question: are the data expected
> to be co-located with the index/vice versa? or would access patterns tend
> to spray across multiple servers as the documents themselves are indirected
> from the id-references in the index?  Very possibly I misunderstand
> index-to-data storage architecture in f-couchdb; corrections/clarification
> gratefully welcomed)
> 
> Segmented distribution of reactive data-processing is a valuable use-case,
> whether using partitioned _changes or filtered _changes or some other way,
> so long as solution architects can count on efficiency and resulting
> scalability.  Reactive data-processing agents would then be able to request
> a specific feed covering a specific set of one or more
> partitions/shards/slices/similar, with a horde of such agents covering the
> full set of slices. It is not clear that FDB-based couchdb would be able to
> provide assurance for that result as it does for range-based index access.
> I imagined using an integer slicing key and a sort of modulus/ring-hash
> scheme, or similar mechanism resulting in redundancy/failover+scaling.
> 
> I would wish to see erl nodes serving a filtered _changes stream to be
> examining (and discarding) no rows, when the filter is index-based and no
> changes are being made to those rows matching the filter.  If that outcome
> only holds for a special partition-ish class of index, it would still seem
> valuable for segmenting the load, both at the functional level of "what
> data does a reactive agent see?" as well as "what computational cost does
> the couchdb server incur?"   It kind of sounds like getting the generalized
> version of that result may be just as much feasible as it may be for the
> specialised case, given some possible constraints on index setup /
> change-feed patterns.
> 
> The gravy on this path would be for _changes feeds to emit
> "no-longer-matches" on data that formerly matched a filter (indexed-only?)
> but no longer does.  I know this wish wouldn't surprise anyone, and I
> understand there's probably a habit of thinking of that result as
> out-of-scope; therefore, I wanted to bring it up as a practically valuable
> design consideration, in case its feasibility is better with FDB.
> 
> <3 for couchdb  ❤️
> 
> Thank you to all.
> 
> 
> 
> 
> On Mon, Apr 20, 2020 at 2:05 PM Robert Samuel Newson <rnewson@apache.org>
> wrote:
> 
>> Hi All,
>> 
>> I'd like to get views on whether we should preserve the _partition
>> endpoints in CouchDB 4.0 or remove them. In CouchDB 4.0 all _view and _find
>> queries will automatically benefit from the same performance boost that the
>> "partitioned database" feature brings, by virtue of FoundationDB.
>> 
>> If we're preserving it, are we also deprecating it (so it's gone in 5.0)?
>> 
>> If we're ditching it, what will the endpoint return instead (404 Not
>> Found, 410 Gone?)
>> 
>> Thoughts welcome.
>> 
>> B.


Mime
View raw message