couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: CouchDB Next
Date Thu, 29 Sep 2016 07:55:02 GMT

> On 29 Sep 2016, at 08:42, Reddy B. <reddy.b@live.fr> wrote:
> 
> Jan and Paul,
> 
> Thanks for all this insight, this is awesome.
> 
> > couch-plugins
> 
> This is it ! Looks very exciting, I'll absolutely start looking into the source code
this week.

Fantastic, thank you, I’d love to see this revived.

The strong consistency thing that Nick mentioned might play into this, as we’d want that
for a cluster-wide configuration system, and plugins could probably very nicely make use of
that.

Best
Jan
--


> 
> 
> On 28/09/2016 21:49, Paul Davis wrote:
>> I definitely screwed up on the mrview/index split in hindsight.
>> couch_index should have been a library app that indexers could use as
>> a toolbox rather than a weirdo plugin/callback system like it is now.
>> Live and learn!
>> 
>> For the extensibility aspect, once we get a solid abstraction like
>> we've been talking about for the HTTP API it seems like this sort of
>> thing would be a lot easier in so much as it'd be standard Erlang
>> procedure for reusing applications. And we could look into having
>> releases of components available on hex.pm.
>> 
>> On Wed, Sep 28, 2016 at 1:26 PM, Robert Samuel Newson
>> <rnewson@apache.org> wrote:
>>> Hi,
>>> 
>>> We can certainly do better on this front. I will say that the (now venerable)
couchdb-lucene project had no problem extending couchdb into full-search capability without
source modifications.
>>> 
>>> In 2.0, it's true that we've made things harder to plugin. The couch_epi application
is our general answer here, it allows a programmatic override to various places, and we can
expand on those hook points easily enough. It does mean writing erlang code, though.
>>> 
>>> When we talk about switching from mochiweb to cowboy, we gain another possibility
to allow extensions through cowboy middleware.
>>> 
>>> To truly make couchdb extensible/pluggable to the degree you seem to be asking
for would be more work than that, I think. Under the covers, of course, couchdb is already
composed of a large set of independent processes that communicate with each other using messages.
>>> 
>>> The couch_index/couch_mrview split from years back was specifically to allow
for new index types to be added smoothly (geocouch was the motivating case, in fact). It's
fair to say that it did not pan out, but other approaches could.
>>> 
>>> I think it best not to raise the specter of COM (or CORBA), the details of that
distract from the intention here. What you seek is a more composable approach, where you could
assemble a system of couchdb components and custom components?
>>> 
>>> It might help at this point to hear some more examples of the extensions you
didn't feel able to make.
>>> 
>>> B.
>>> 
>>>> On 28 Sep 2016, at 13:04, Reddy B. <reddy.b@live.fr> wrote:
>>>> 
>>>> I've been very busy with work for one month only and when I catch up 2.0
is out and you're even talking about 3.0, congratulations.
>>>> 
>>>> I'd like to contribute to this list, I've not read the source code of CouchDb
yet so I can't be too precise but as the head of development of several companies, I thought
my proposition could be valuable.
>>>> 
>>>> The one big regret I have with CouchDb is the difficulty to extend it. Namely
the necessity to rebuild CouchDb from sources to add things such as Lucene, or even GeoCouch.
To take our example, we would have contributed a number of extensions to CouchDb already if
it wasn't for that. Perhaps it's just me, but there really is a psychological threshold to
pass to get into building a third-party project, and another one to get into forking it. I
personally don't know if I'll ever get around it, because there's too much cost and maintenance
requirements involved.
>>>> 
>>>> I'm not sure exactly what the limitation is and if this is achievable, but
some sort of language agnostic plugin architecture/extendability pipeline would be absolutely
great and in my opinion can be an interesting priority for a version 3.0, as it would dramatically
help boost the number of contributions to the CouchDb ecosystem. I'm not sure I have the terminology
right, but it might all come down to making the creation of custom indexes rebuild-free and
language agnostic. I'm thinking of something in the idea of COM APIs <https://msdn.microsoft.com/en-us/library/windows/desktop/ms680573%28v=vs.85%29.aspx>.
>>>> 
>>>> If you find the idea interesting, I'd be happy to start getting my hands
dirty and work on it.
>>>> 
>>>> 
>>>> On 27/09/2016 14:56, Jan Lehnardt wrote:
>>>>> Hi all,
>>>>> 
>>>>> apologies in advance, this is going to be a long email.
>>>>> 
>>>>> 
>>>>> I’ve been holding this back intentionally in order to be able to focus
on shipping 2.0, but now that that’s out, I feel we should talk about what’s next.
>>>>> 
>>>>> This email is separated into areas of work that I think CouchDB could
improve on, some with very concrete plans, some with rather vague ideas. I’ve been collecting
these over the past year or <strike>two</strike>five, so it’s fairly wide, but
I’m sure I’m missing things that other people find important, so please add to this list.
>>>>> 
>>>>> After the initial discussion here, I’ll move all of the individual
issues to JIRA, so we can go down our usual process.
>>>>> 
>>>>> This is basically my wish list, and I’d like this to become everyone’s
wish list, so please add what I’ve been missing. :) — Note, this isn’t a free-for-all,
only suggest things that you are prepared to see through being shipped, from design, implementation
to docs.
>>>>> 
>>>>> I don’t have a specific order for these in mind, although I have a
rough idea of what we should be doing first. Putting all of this on a roadmap is going to
be a fun future exercise for us, though :)
>>>>> 
>>>>> One last note: this doesn’t include anything on documentation or testing.
I fully expect to step our game from here on out. This list is for the technical aspects of
the project.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> These are the areas of work I’ve roughly come up with that my suggestions
fit into:
>>>>> 
>>>>> - API
>>>>> - Storage
>>>>> - Query
>>>>> - Replication
>>>>> - Cluster
>>>>> - Fauxton
>>>>> - Releases
>>>>> - Performance
>>>>> - Internals
>>>>> - Builds
>>>>> - Features
>>>>> 
>>>>> (I’m not claiming these are any good, but it’s what I’ve got)
>>>>> 
>>>>> 
>>>>> Let’s go.
>>>>> 
>>>>> 
>>>>> * * *
>>>>> 
>>>>> # API
>>>>> 
>>>>> ## HTTP2
>>>>> 
>>>>> I think this is an obvious first next step. Our HTTP Layer needs work,
our existing HTTP server library is not getting HTTP2 support, it’s time to attack this
head-first. I’m imagining a Cowboy[1]-based HTTP layer that calls into a unified internals
layer and everything will be rose-golden. HTTP2 support for Cowboy is still in progress. Maybe
we can help them along, or we focus on the internals refactor first and drop Cowboy in later
(not sure how feasible this approach is, but we’ll figure this out.
>>>>> 
>>>>> In my head, we focus on this and call the result 3.0 in 6-12 months.
That doesn’t mean we *only* do this, but this will be the focus (more on this later).
>>>>> 
>>>>> There are a few fun considerations, mainly of the “avoid Python 2/3-chasm”-type.
Do we re-implement the 2.0 API with all its idiosyncrasies, or do we take the opportunity
to clean things up while we are at it? If yes, how and how long do we support the then old
API? Do we manage this via different ports? If yes, how can this me made to work for hosting
services like Cloudant? Etc. etc.
>>>>> 
>>>>> [1] https://github.com/ninenines/cowboy
>>>>> 
>>>>> 
>>>>> ## Sub-Document Operations
>>>>> 
>>>>> Currently a doc update needs the whole doc body sent to the server. There
are some obvious performance improvements possible. For the longest time, I wanted to see
if we can model sub-document operations via JSON Pointers[2]. These would roughly allow pointing
to a JSON value via a URL.
>>>>> 
>>>>> For example in this doc:
>>>>> 
>>>>> {
>>>>>   "_id": "123abc",
>>>>>   "_rev": "zyx987",
>>>>>   "contact": {
>>>>>     "name": "",
>>>>>     "address": {
>>>>>       "street": "Long Street",
>>>>>       "nr": 123
>>>>>       "zip": "12345"
>>>>>     }
>>>>> }
>>>>> 
>>>>> An update to the zip code could look like this:
>>>>> 
>>>>> curl -X POST $SERVER/db/123abc/_jsonpointer/contact/address/zip?rev=zyx987
-d '54321'
>>>>> 
>>>>> GET/DELETE accordingly. We could shortcut the `_jsonpointer` to just
`_` if we like the short magic.
>>>>> 
>>>>> JSONPointer can deal with nested objects and lists and works fairly well
for this type of stuff, and it is rather simple to implement (even I could do it: https://github.com/janl/erl-jsonpointer/blob/master/src/jsonpointer.erl
— This idea is literally 5 years old, it looks like, no need to use my code if there is
anything better).
>>>>> 
>>>>> This is just a raw idea, and I’m happy to solve this any other way,
if somebody has a good approach.
>>>>> 
>>>>> [2] https://tools.ietf.org/html/rfc6901
>>>>> 
>>>>> 
>>>>> ## HTTP PATCH / JSON Diff
>>>>> 
>>>>> Another stab at a similar problem are HTTP PATCH with JSON Diff, but
with the inherent problems of JSON normalisation, I’m leaning towards the JSONPointer variant
as simpler, but I’d be open for this as well, if someone comes up with a good approach.
>>>>> 
>>>>> 
>>>>> ## GraphQL[3]
>>>>> 
>>>>> It’s rather new, but getting good traction[4]. This would be a nice
addition to our API. Somebody might already be hacking on this ;)
>>>>> 
>>>>> [3]: http://graphql.org
>>>>> [4]: http://githubengineering.com/the-github-graphql-api/
>>>>> 
>>>>> 
>>>>> ## Mango for Document Validation
>>>>> 
>>>>> The only place where we absolutely require writing JS is validate_doc_update
functions. Some security behaviour can only be enforced there. With their inherent performance
problems, I’d like to get doc validations out of the path of the query server and would
love to find a way to validate document updates through Mango.
>>>>> 
>>>>> 
>>>>> ## Redesign Security System
>>>>> 
>>>>> Our security system is slowly grown and not coherently designed. We should
start over. I have many ideas and opinions, but they are out of scope for this. I think everybody
here agrees that we can do better. This *very likely* will *not* include per-document ACLs
as per the often stated issues with that approach in our data model.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Replication
>>>>> 
>>>>> This is our flagship feature of course, and there are a few things we
can do better.
>>>>> 
>>>>> 
>>>>> ## Mobile-optimised extension or new version of the protocol
>>>>> 
>>>>> The original protocol design didn’t take mobile devices into account
and through PouchDB et.al. we are now learning that there are number of downsides to our protocol.
We’ve helped a lot with introducing _bulk_get/_revs, but that’s more a bandaid than a
considered strategy ;)
>>>>> 
>>>>> That new version could also be HTTP2-only, to take advantage of the new
connection semantics there.
>>>>> 
>>>>> 
>>>>> ## Easy way to skip deletes on sync
>>>>> 
>>>>> This one is self-explanatory, mobile clients usually don’t need to
sync deletes from a year ago first. Mango filters might already get us there, maybe we can
do better.
>>>>> 
>>>>> 
>>>>> ## Sync a rolling subset
>>>>> 
>>>>> Say you always want to keep the last 90 days of email on a mobile device
with optionally back-loading older documents on user-request. It is something I could see
getting a lot of traction.
>>>>> 
>>>>> Today, this can be built on 1.x with clever use of _purge, but that’s
hardly a good experience. I don’t know if it can be done in a cluster.
>>>>> 
>>>>> 
>>>>> ## Selective Sync
>>>>> 
>>>>> There might be other criteria than “last 90 days”, so the more general
solution to this problem class would be arbitrary (e.g. client-directed) selective sync, but
this might be really hard as opposed to just very hard of the “last 90 days” one, so happy
to punt on this first. But filters are generally not the answer, especially with large data
sets. Maybe proper sync from views _changes is the answer.
>>>>> 
>>>>> 
>>>>> ## A _db_updates powered _replicator DB
>>>>> 
>>>>> Running thousands+ of replications on a server is not really resource
friendly today, we should teach the replicator to only run replication on active databases
via _db_updates. Somebody might already be looking into this one.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Storage
>>>>> 
>>>>> 
>>>>> ## Pluggable Storage Engines
>>>>> 
>>>>> Paul Davis already showed some work on allowing multiple different storage
backends. I’d like to see this land.
>>>>> 
>>>>> ## Different Storage Backends
>>>>> 
>>>>> These don’t all have to be supported by the main project, but I’d
really like to see some experimentation with different backends like LevelDB[5]/RocksDB[6],
InnoDB[7], SQLite[8] a native-erlang one that is optimised for space usage and not performance
(I don’t want to budge on safety). Similarly, it’d be fun to see if there is a compression
format that we can use as a storage backend directly, so we get full-DB compression as opposed
to just per-doc compression.
>>>>> 
>>>>> [5]: http://leveldb.org
>>>>> [6]: http://rocksdb.org
>>>>> [7]: https://en.wikipedia.org/wiki/InnoDB
>>>>> [8]: https://www.sqlite.org
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Query
>>>>> 
>>>>> ## Teach Mango JOINs and result sorting
>>>>> 
>>>>> It’s the natural path for query languages. We should make these happen.
Once we have the basics, we might even be able to find a way to compile basic SQL into Mango,
it’s going to be glorious :)
>>>>> 
>>>>> 
>>>>> ## “No-JavaScript”-mode
>>>>> 
>>>>> I’ve hinted at this above, but I’d really like a way for users to
use CouchDB productively without having to write a line of JavaScript. My main motivation
is the poor performance characteristics of the Query Server (hello CGI[9]?). But even with
one that is improved, it will always faster to do any, say filtering or validation operations
in native Erlang. I don’t know if we can expand Mango to cover all this, and I’m not really
concerned about the specifics, as long as we get there.
>>>>> 
>>>>> Of course, for pro-users, the JS-variant will still be around.
>>>>> 
>>>>> [9]: https://en.wikipedia.org/wiki/Common_Gateway_Interface
>>>>> 
>>>>> 
>>>>> ## Query Server V2
>>>>> 
>>>>> We need to revamp the Query Server. It is hardcoded to an out-of-date
version of SpiderMonkey and we are stuck with C-bindings that barely anyone dares to look
at, let alone iterate on.
>>>>> 
>>>>> I believe the way forward is re-vamping the query server protocol to
use streaming IO instead of blocking batches like we do now, and use JS-native implementation
of the JS-side instead of C-bindings.
>>>>> 
>>>>> I’m partial to doing this straight in Node, because there is a ton
of support for things we need already, and I believe we’ve solved the isolation issues required
for secure MapReduce, but I’m happy to use any other thing as well, if it helps.
>>>>> 
>>>>> Other benefits would be support for emerging JS features that devs will
want to use.
>>>>> 
>>>>> And we can have two modes: standalone QS like now, and embedded QS where,
say, V8 is compiled into the Erlang VM. Not everybody will want to run this, but it’ll be
neat for those who do.
>>>>> 
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Cluster
>>>>> 
>>>>> ## Rebalancing
>>>>> 
>>>>> With this we will be able to grow clusters one by one instead of hitting
a wall when eventually each shard lives on a single machine. E.g. when you add a node to the
cluster, all other nodes share 1/Nth of their data with the new node, and everything can keep
going. Same for removing a node and shrinking the cluster.
>>>>> 
>>>>> Couchbase has this and it is really nice.
>>>>> 
>>>>> 
>>>>> ## Setup
>>>>> 
>>>>> Even without rebalancing, we need a nice Fauxton UI to manage the cluster,
so far we only have a simple setup procedure (which is great don’t get me wrong), but users
will want to do more elaborate cluster management and we should make that easy with a slick
UI.
>>>>> 
>>>>> 
>>>>> ## Cluster-Aware Clients
>>>>> 
>>>>> This might end up being not a good idea, but I’d like some experimentation
here. Say you’d have a CouchDB client that could be hooked into the cluster topology so
it’d know which nodes to query for which data, then we can save a proxy-hop, and build clients
that have lower-latency access to CouchDB. Again, this is something that Couchbase does and
I think is worth exploring.
>>>>> 
>>>>> 
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Fauxton
>>>>> 
>>>>> Fauxton is great, but it could be better too, I think. I’m mostly concerned
about number of clicks/taps required for more specialised actions (like setting the group_level
of a reduce query, it’s like 15 or so). More cluster info would also be nice, and maybe
a specialised dashboard for db-per-user setups.
>>>>> 
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Releases
>>>>> 
>>>>> 
>>>>> ## Six-Week Release Trains
>>>>> 
>>>>> We need to get back to frequent releases and I propose to go back to
our six-week-release train plans from three years ago. Whatever lands within a release train
time frame goes out. The nature of the change dictates the version number increment as per
semver, and we just ship a new version every six weeks, even if it only includes a single
bug fix. We should automate most of this infrastructure, so actual releases are cheap. We
are reasonably close with this, but we need some more folks to step up on using and maintaining
our CI systems.
>>>>> 
>>>>> 
>>>>> ## One major feature per major version
>>>>> 
>>>>> I also propose to keep the scope of future major versions small, so we
don’t have to wait another 3-5 years for 3.0. In particular, I think we should focus on
a single major feature per major version and get that shipped within 6-12 months tops. If
anything needs more time, it needs to be broken up. Of course we continue to add features
and fix things while this happens, but as a project, there is *one* major feature we push.
For example, for 3.0 I see our push be behind HTTP2 support. There is a lot of subsequent
work required to make that happen, so it’ll be a worthwhile 3.0, but we can ship it in 6-12
months (hopefully).
>>>>> 
>>>>> Best case scenario, we have CouchDB 4.0 coming out 12 months from now
with two new major features. That would be amazing.
>>>>> 
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Performance
>>>>> 
>>>>> ## Perf Team
>>>>> 
>>>>> We need a team to comprehensive look at CouchDB performance. There is
a lot of low-hanging fruit like Robert Kowalski showed a while back, we should get back into
this. I’m mostly inspired by SQLite who’ve done a release a while back that only focussed
on 1-2% performance improvements, but got like 20-30 of those and made the thing a lot faster
across the board. I can’t remember where I read about this, but I’ll update this once
I find the link.
>>>>> 
>>>>> 
>>>>> ## Benchmark Suite
>>>>> 
>>>>> We need a benchmark suite that tests a variety of different work loads.
The goal here is to run different versions of CouchDB against the same suite on the same hardware,
to see where are going. I’m imagining a http://arewefastyet.com style dashboard where we
can track this, and even run this on Pull Requests and not allow them if they significantly
impact performance.
>>>>> 
>>>>> 
>>>>> ## Synthetic Load Suite
>>>>> 
>>>>> This one is for end users. I’d like to be able to say: My app produces
mostly 10-20kb-sized docs, but millions of those in a single database, or across 1000s of
databases, with these views etc. and then run this on target hardware so I’d know, e.g.
how many nodes I need for a cluster with my estimated workload. I know this can only be done
in approximation, but I think this could make a big difference in CouchDB adoption and feed
back into Perf Team mentioned above.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Internals
>>>>> 
>>>>> ## Consolidate Repositories
>>>>> 
>>>>> With 2.0 we started to experiment with radically small modules for our
components and I think we’ve come to the conclusion that some consolidation is better for
us going forward. Obvious candidates for separate repos are docs, Fauxton etc. but also some
of the Erlang modules that other projects reasonably would use.
>>>>> 
>>>>> 
>>>>> ## Elixir
>>>>> 
>>>>> I’d like it very much if we elevate Elixir as a prime target language
for writing CouchDB internals. I believe this would get us an influx of new developers that
we badly need to get all the things I’m listing here done. Somebody might be looking into
the technical aspects of this already, but we need to decide as a project if we are okay with
that.
>>>>> 
>>>>> 
>>>>> ## GitHub Issues
>>>>> 
>>>>> I hope we can transition to GitHub Issues soon.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Builds
>>>>> 
>>>>> I’d like automated builds for source, Docker et.al., rpm, deb, brew,
ports, Mac Binary, etc with proper release channels for people to subscribe to, all powered
by CI for nightly builds, so people can test in-development versions easily.
>>>>> 
>>>>> I’d also like builds that include popular community plugins like Geo
or Fulltext Search.
>>>>> 
>>>>> 
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> # Features
>>>>> 
>>>>> ## Better Support for db-per-user
>>>>> 
>>>>> I don’t know what this will look like, but this is a pattern, and we
need to support it better.
>>>>> 
>>>>> One approach could be “virtual dbs” that are backed by a single database,
but that’s usually at odds with views, so we could make this an XOR and disable views on
these dbs. Since this usually powers client-heavy apps, querying usually happens there anyway.
>>>>> 
>>>>> Another approach would be better / easier cross-db aggregation or querying.
There are a few approaches, but nothing really slick.
>>>>> 
>>>>> 
>>>>> ## Schema Extraction
>>>>> 
>>>>> I have half an (old) patch that extracts top level fields from a document
and stores them with a hash in an “attachment” to the database header. So we only end
up storing doc values and the schema hash. First of all this trades storage for CPU time (I
haven’t measured anything yet), but more interestingly, we could use that schema data to
do smart things like auto-generating a validation function / mango expression based on the
data that is already in the database. And other fun things like easier schema migration operations
that are native in CouchDB and thus a lot faster than external ones. For the curious ones,
I’ve got the idea from V8’s property access optimisation strategy[10].
>>>>> 
>>>>> [10]: https://github.com/v8/v8/wiki/Design%20Elements#fast-property-access
>>>>> 
>>>>> * * *
>>>>> 
>>>>> Alright, that’s it for now. Can’t wait for your feedback!
>>>>> 
>>>>> Best
>>>>> Jan
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/


Mime
View raw message