From dev-return-48255-archive-asf-public=cust-asf.ponee.io@couchdb.apache.org  Fri Feb  8 06:45:28 2019
Return-Path: <dev-return-48255-archive-asf-public=cust-asf.ponee.io@couchdb.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A490C18060E
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  8 Feb 2019 07:45:27 +0100 (CET)
Received: (qmail 26633 invoked by uid 500); 8 Feb 2019 06:45:26 -0000
Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@couchdb.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@couchdb.apache.org>
List-Post: <mailto:dev@couchdb.apache.org>
List-Id: <dev.couchdb.apache.org>
Reply-To: dev@couchdb.apache.org
Delivered-To: mailing list dev@couchdb.apache.org
Received: (qmail 26604 invoked by uid 99); 8 Feb 2019 06:45:26 -0000
Received: from mail-relay.apache.org (HELO mailrelay1-lw-us.apache.org) (207.244.88.152)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Feb 2019 06:45:26 +0000
Received: from mail-it1-f172.google.com (mail-it1-f172.google.com [209.85.166.172])
	by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 670AB114E
	for <dev@couchdb.apache.org>; Fri,  8 Feb 2019 06:45:25 +0000 (UTC)
Received: by mail-it1-f172.google.com with SMTP id b5so6479108iti.2
        for <dev@couchdb.apache.org>; Thu, 07 Feb 2019 22:45:25 -0800 (PST)
X-Gm-Message-State: AHQUAua4Smy2Wgkj+8ZVkUfwHgnWbK5+xLItwS81CpuUBP0VI3NHFS4c
	vfuFhhEkvVn00FcnT9EbDOh96cBmgP8AX2FmekvjtA==
X-Google-Smtp-Source: AHgI3IaMszzMcgkAmdoYUJ5WrLo0/dWCTpC/CbvsTWaoeG1Pmo2VjZugmaRTkTRqYnFu1ruzBQrUKlIzkya/4sz0NuU=
X-Received: by 2002:a5d:9a98:: with SMTP id c24mr11274786iom.227.1549608324812;
 Thu, 07 Feb 2019 22:45:24 -0800 (PST)
MIME-Version: 1.0
References: <CAOFTT0yYEEOf0gTHsACuhe1Ok+UAVeeHOTvsj-SPLSPB3TGDbw@mail.gmail.com>
 <E53B210C-20EC-4CB9-8B28-25CD78890A1D@apache.org> <pony-3fd039c72976f9f89629fb5b95d0a929c183add0-5e66abcd9165e03a3602983a15a47a9ef3588519@dev.couchdb.apache.org>
 <1549309264.3455235.1650702256.0515B088@webmail.messagingengine.com>
 <pony-3fd039c72976f9f89629fb5b95d0a929c183add0-129e99152c082ae65c49cebac47365a9b0cf2ac8@dev.couchdb.apache.org>
 <571B799C-6EFF-430B-BF49-F74FF0D02623@apache.org> <5DA78E37-319E-4BF7-9E39-CFDA7B1718E9@apache.org>
 <pony-3fd039c72976f9f89629fb5b95d0a929c183add0-4eb266cf4e6501072bfa4ed216645c2197421510@dev.couchdb.apache.org>
 <1549576374.3293989.1653251968.446DBE10@webmail.messagingengine.com> <CE4D5AC8-B837-4622-8192-ABCA9B606E8F@apache.org>
In-Reply-To: <CE4D5AC8-B837-4622-8192-ABCA9B606E8F@apache.org>
From: Garren Smith <garren@apache.org>
Date: Fri, 8 Feb 2019 08:45:13 +0200
X-Gmail-Original-Message-ID: <CAOFTT0xadq4hFsWQ2aNv6JSCjCEGr+M4pfyA_dGG5OgxgVHMbQ@mail.gmail.com>
Message-ID: <CAOFTT0xadq4hFsWQ2aNv6JSCjCEGr+M4pfyA_dGG5OgxgVHMbQ@mail.gmail.com>
Subject: Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts
To: dev@couchdb.apache.org
Content-Type: multipart/alternative; boundary="000000000000bde5d605815c4d87"

--000000000000bde5d605815c4d87
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Adam,

Thanks for the detailed email. In terms of the data model, that makes a lot
of sense.

I=E2=80=99m still playing a bit of catchup on understanding how fdb works, =
so I
can=E2=80=99t comment on the best way to retrieve a document.

From my side, I would like to see our decisions also driven by testing and
validating that our data model works. I find the way that fdb was tested
and built really impressive. I would love to see us apply some of that to
the way we build our CouchDB layer.

Cheers
Garren

On Fri, Feb 8, 2019 at 5:35 AM Adam Kocoloski <kocolosk@apache.org> wrote:

> Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it=E2=80=
=99s a
> bit =E2=80=9Csimplistic" to exclusively choose simplicity over performanc=
e and
> storage density. We=E2=80=99re (re)building a database here, one that has=
 some
> users with pretty demanding performance and scalability requirements. And
> yes, we should certainly be testing and measuring. Kyle and team are
> setting up infrastructure in IBM land to help with that now, but I also
> believe we can design the data model and architecture with a basic
> performance model of FoundationDB in mind:
>
> - reads cost 1ms
> - short range reads are the same cost as a single lookup
> - reads of independent parts of the keyspace can be parallelized for chea=
p
> - writes are zero-cost until commit time
>
> We ought to be able to use these assumptions to drive some decisions abou=
t
> data models ahead of any end-to-end performance test.
>
> If there are specific elements of the edit conflicts management where you
> think greater simplicity is warranted, let=E2=80=99s get those called out=
. Ilya
> noted (correctly, in my opinion) that the term sharing stuff is one of
> those items. It=E2=80=99s relatively complex, potentially a performance h=
it, and
> only saves on storage density in the corner case of lots of edit conflict=
s.
> That=E2=80=99s a good one to drop.
>
> I=E2=80=99m relatively happy with the revision history data model at this=
 point.
> Hopefully folks find it easy to grok, and it=E2=80=99s efficient for both=
 reads and
> writes. It costs some extra storage for conflict revisions compared to th=
e
> current tree representation (up to 16K per edit branch, with default
> _revs_limit) but knowing what we know about the performance death spiral
> for wide revision trees today I=E2=80=99ll happily make a storage vs. per=
formance
> tradeoff here :)
>
> Setting the shared term approach aside, I=E2=80=99ve still been mulling o=
ver the
> key structure for the actual document data:
>
> -  I thought about trying to construct a special _conflicts subspace, but
> I don=E2=80=99t like that approach because the choice of a =E2=80=9Cwinni=
ng" revision can
> flip back and forth very quickly with concurrent writers to different edi=
t
> branches. I think we really want to have a way for revisions to naturally
> sort themselves so the winner is the first or last revision in a list.
>
> - Assuming we=E2=80=99re using key paths of the form (docid, revision-ish=
, path,
> to, field), the goal here is to find an efficient way to get the last key
> with prefix =E2=80=9Cdocid=E2=80=9D (assuming winner sorts last), and the=
n all the keys
> that share the same (docid, revision-ish) prefix as that one. I see two
> possible approaches so far, neither perfect:
>
> Option 1: Execute a get_key() operation with a key selector that asks for
> the last key less than =E2=80=9Cdocid\xFF=E2=80=9D (again assuming winner=
 sorts last), and
> then do a get_range_startswith() request setting the streaming mode to
> =E2=80=9Cwant_all=E2=80=9D and the prefix to the docid plus whatever revi=
sion-ish we found
> from the get_key() request. This is two roundtrips instead of one, but it
> always retrieves exactly the right set of keys, and the second step is
> executed as fast as possible.
>
> Option 2: Jump straight to get_range_startswith() request using only
> =E2=80=9Cdocid=E2=80=9D as the prefix, then cancel the iteration once we =
reach a revision
> not equal to the first one we see. We might transfer too much data, or we
> might end up doing multiple roundtrips if the default =E2=80=9Citerator=
=E2=80=9D streaming
> mode sends too little data to start (I haven=E2=80=99t checked what the d=
efault
> iteration block is there), but in the typical case of zero edit conflicts
> we have a good chance of retrieving the full document in one roundtrip.
>
> I don=E2=80=99t have a good sense of which option wins out here from a pe=
rformance
> perspective, but they=E2=80=99re both operating on the same data model so=
 easy
> enough to test the alternatives. The important bit is getting the
> revision-ish things to sort correctly. I think we can do that by generati=
ng
> something like
>
> revision-ish =3D NotDeleted/1bit : RevPos : RevHash
>
> with some suitable order-preserving encoding on the RevPos integer.
>
> Apologies for the long email. Happy for any comments, either here or over
> on IRC. Cheers,
>
> Adam
>
> > On Feb 7, 2019, at 4:52 PM, Robert Newson <rnewson@apache.org> wrote:
> >
> > I think we should choose simple. We can then see if performance is too
> low or storage overhead too high and then see what we can do about it.
> >
> > B.
> >
> > --
> >  Robert Samuel Newson
> >  rnewson@apache.org
> >
> > On Thu, 7 Feb 2019, at 20:36, Ilya Khlopotov wrote:
> >> We cannot do simple thing if we want to support sharing of JSON terms.
> I
> >> think if we want the simplest path we should move sharing out of the
> >> scope. The problem with sharing is we need to know the location of
> >> shared terms when we do write. This means that we have to read full
> >> document on every write. There might be tricks to replace full documen=
t
> >> read with some sort of hierarchical signature or sketch of a document.
> >> However these tricks do not fall into simplest solution category. We
> >> need to choose the design goals:
> >> - simple
> >> - performance
> >> - reduced storage overhead
> >>
> >> best regards,
> >> iilyak
> >>
> >> On 2019/02/07 12:45:34, Garren Smith <garren@apache.org> wrote:
> >>> I=E2=80=99m also in favor of keeping it really simple and then testin=
g and
> >>> measuring it.
> >>>
> >>> What is the best way to measure that we have something that works? I=
=E2=80=99m
> not
> >>> sure just relying on our current tests will prove that? Should we
> define
> >>> and build some more complex situations e.g docs with lots of conflict=
s
> or
> >>> docs with wide revisions and make sure we can solve for those?
> >>>
> >>> On Thu, Feb 7, 2019 at 12:33 PM Jan Lehnardt <jan@apache.org> wrote:
> >>>
> >>>> I=E2=80=99m also very much in favour with starting with the simplest=
 thing
> that
> >>>> can possibly work and doesn=E2=80=99t go against the advertised best
> practices of
> >>>> FoundationDB. Let=E2=80=99s get that going and get a feel for how it=
 all works
> >>>> together, before trying to optimise things we can=E2=80=99t measure =
yet.
> >>>>
> >>>> Best
> >>>> Jan
> >>>> =E2=80=94
> >>>>
> >>>>> On 6. Feb 2019, at 16:58, Robert Samuel Newson <rnewson@apache.org>
> >>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> With the Redwood storage engine under development and with prefix
> >>>> elision part of its design, I don=E2=80=99t think we should get too =
hung up on
> >>>> adding complications and indirections in the key space just yet. We
> haven=E2=80=99t
> >>>> written a line of code or run any tests, this is premature
> optimisation.
> >>>>>
> >>>>> I=E2=80=99d like to focus on the simplest solution that yields all =
required
> >>>> properties. We can embellish later (if warranted).
> >>>>>
> >>>>> I am intrigued by all the ideas that might allow us cheaper inserts
> and
> >>>> updates than the current code where there are multiple edit branches
> in the
> >>>> stored document.
> >>>>>
> >>>>> B.
> >>>>>
> >>>>>> On 6 Feb 2019, at 02:18, Ilya Khlopotov <iilyak@apache.org> wrote:
> >>>>>>
> >>>>>> While reading Adam's proposal I came to realize that: we don't hav=
e
> to
> >>>> calculate winning revision at read time.
> >>>>>> Since FDB's transactions are atomic we can calculate it when we
> write.
> >>>> This means we can just write latest values into separate range. This
> makes
> >>>> lookup of latest version fast.
> >>>>>> Another realization is if we want to share values for some json
> paths
> >>>> we would have to introduce a level of indirection.
> >>>>>> Bellow is the data model inspired by Adam's idea to share
> json_paths.
> >>>> In this model the json_path is stored in the revision where it was
> first
> >>>> added (we call that revision an owner of a json_path). The values fo=
r
> >>>> json_path key can be scalar values, parts of scalar values or
> pointers to
> >>>> owner location.
> >>>>>> The below snippets are sketches of transactions.
> >>>>>> The transactions will include updates to other keys as needed
> >>>> (`external_size`, `by_seq` and so on).  The revision tree management
> is not
> >>>> covered yet.
> >>>>>> The `rev -> vsn` indirection is not strictly required. It is added
> >>>> because it saves some space since `rev` is a long string and `vsn` i=
s
> FDB
> >>>> versionstamp of fixed size.
> >>>>>>
> >>>>>> - `{NS} / {docid} / _by_rev / {rev} =3D vsn`
> >>>>>> - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} =3D NIL=
`
> >>>>>> - `{NS} / {docid} / _data / {json_path} =3D latest_value | part`
> >>>>>> - `{NS} / {docid} / {vsn} / _data / {json_path} =3D value | part |
> >>>> {another_vsn}`
> >>>>>>
> >>>>>> ```
> >>>>>> write(txn, doc_id, prev_rev, json):
> >>>>>> txn.add_write_conflict_key("{NS} / {doc_id} / _rev")
> >>>>>> rev =3D generate_new_rev()
> >>>>>> txn["{NS} / {docid} / _by_rev / {rev}"] =3D vsn
> >>>>>> for every json_path in flattened json
> >>>>>>   - {NS} / {docid} / _used_by / {json_path} / {another_vsn} =3D NI=
L
> >>>>>>   if rev is HEAD:
> >>>>>>     # this range contains values for all json paths for the latest
> >>>> revision (read optimization)
> >>>>>>     - {NS} / {docid} / _data / {json_path} =3D latest_value | part
> >>>>>>   - {NS} / {docid} / {vsn} / _data / {json_path} =3D value | part =
|
> >>>> {another_vsn}
> >>>>>> txn["{NS} / {doc_id} / _rev"] =3D rev
> >>>>>>
> >>>>>> get_current(txn, doc_id):
> >>>>>> # there is no sharing of json_paths in this range (read
> optimization)
> >>>>>> txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} /
> _data
> >>>> / 0xFF" )
> >>>>>>
> >>>>>> get_revision(txn, doc_id, rev):
> >>>>>> vsn =3D txn["{NS} / {docid} / _by_rev / {rev}"]
> >>>>>> json_paths =3D txn.get_range("{NS} / {vsn} / {docid} / _data / 0x0=
0",
> >>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
> >>>>>> for every json_path in json_paths:
> >>>>>>  if value has type vsn:
> >>>>>>     another_vsn =3D value
> >>>>>>        value =3D txn["{NS} / {docid} / {another_vsn} / _data /
> >>>> {json_path}"]
> >>>>>>  result[json_path] =3D value
> >>>>>>
> >>>>>> delete_revision(txn, doc_id, rev):
> >>>>>> vsn =3D txn["{NS} / {docid} / _by_rev / {rev}"]
> >>>>>> json_paths =3D txn.get_range("{NS} / {vsn} / {docid} / _data / 0x0=
0",
> >>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
> >>>>>> for every json_path in json_paths:
> >>>>>>  if value has type vsn:
> >>>>>>    # remove reference to deleted revision from the owner
> >>>>>>     del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}]
> >>>>>>  # check if deleted revision of json_path is not used by anything
> else
> >>>>>>  if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}=
",
> >>>> limit=3D1) =3D=3D []:
> >>>>>>     del txn["{NS} / {docid} / {vsn} / _data / {json_path}"]
> >>>>>>  if vsn is HEAD:
> >>>>>>     copy range for winning revision into "{NS} / {docid} / _data /
> >>>> {json_path}"
> >>>>>> ```
> >>>>>>
> >>>>>> best regards,
> >>>>>> iilyak
> >>>>>>
> >>>>>> On 2019/02/04 23:22:09, Adam Kocoloski <kocolosk@apache.org> wrote=
:
> >>>>>>> I think it=E2=80=99s fine to start a focused discussion here as i=
t might
> help
> >>>> inform some of the broader debate over in that thread.
> >>>>>>>
> >>>>>>> As a reminder, today CouchDB writes the entire body of each
> document
> >>>> revision on disk as a separate blob. Edit conflicts that have common
> fields
> >>>> between them do not share any storage on disk. The revision tree is
> encoded
> >>>> into a compact format and a copy of it is stored directly in both th=
e
> by_id
> >>>> tree and the by_seq tree. Each leaf entry in the revision tree
> contain a
> >>>> pointer to the position of the associated doc revision on disk.
> >>>>>>>
> >>>>>>> As a further reminder, CouchDB 2.x clusters can generate edit
> conflict
> >>>> revisions just from multiple clients concurrently updating the same
> >>>> document in a single cluster. This won=E2=80=99t happen when Foundat=
ionDB is
> >>>> running under the hood, but users who deploy multiple CouchDB or
> PouchDB
> >>>> servers and replicate between them can of course still produce
> conflicts
> >>>> just like they could in CouchDB 1.x, so we need a solution.
> >>>>>>>
> >>>>>>> Let=E2=80=99s consider the two sub-topics separately: 1) storage =
of edit
> >>>> conflict bodies and 2) revision trees
> >>>>>>>
> >>>>>>> ## Edit Conflict Storage
> >>>>>>>
> >>>>>>> The simplest possible solution would be to store each document
> >>>> revision separately, like we do today. We could store document bodie=
s
> with
> >>>> (=E2=80=9Cdocid=E2=80=9D, =E2=80=9Crevid=E2=80=9D) as the key prefix=
, and each transaction could
> clear the
> >>>> key range associated with the base revision against which the edit i=
s
> being
> >>>> attempted. This would work, but I think we can try to be a bit more
> clever
> >>>> and save on storage space given that we=E2=80=99re splitting JSON do=
cuments
> into
> >>>> multiple KV pairs.
> >>>>>>>
> >>>>>>> One thought I=E2=80=99d had is to introduce a special enum Value =
which
> >>>> indicates that the subtree =E2=80=9Cbeneath=E2=80=9D the given Key i=
s in conflict. For
> >>>> example, consider the documents
> >>>>>>>
> >>>>>>> {
> >>>>>>>  =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D,
> >>>>>>>  =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-abc=E2=80=9D,
> >>>>>>>  =E2=80=9Cowner=E2=80=9D: =E2=80=9Calice=E2=80=9D,
> >>>>>>>  =E2=80=9Cactive=E2=80=9D: true
> >>>>>>> }
> >>>>>>>
> >>>>>>> and
> >>>>>>>
> >>>>>>> {
> >>>>>>>  =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D,
> >>>>>>>  =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-def=E2=80=9D,
> >>>>>>>  =E2=80=9Cowner=E2=80=9D: =E2=80=9Cbob=E2=80=9D,
> >>>>>>>  =E2=80=9Cactive=E2=80=9D: true
> >>>>>>> }
> >>>>>>>
> >>>>>>> We could represent these using the following set of KVs:
> >>>>>>>
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D) =3D true
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D) =3D kCONFLICT
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-abc=
=E2=80=9D) =3D =E2=80=9Calice=E2=80=9D
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-def=
=E2=80=9D) =3D =E2=80=9Cbob=E2=80=9D
> >>>>>>>
> >>>>>>> This approach also extends to conflicts where the two versions ha=
ve
> >>>> different data types. Consider a more complicated example where bob
> dropped
> >>>> the =E2=80=9Cactive=E2=80=9D field and changed the =E2=80=9Cowner=E2=
=80=9D field to an object:
> >>>>>>>
> >>>>>>> {
> >>>>>>> =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D,
> >>>>>>> =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-def=E2=80=9D,
> >>>>>>> =E2=80=9Cowner=E2=80=9D: {
> >>>>>>>  =E2=80=9Cname=E2=80=9D: =E2=80=9Cbob=E2=80=9D,
> >>>>>>>  =E2=80=9Cemail=E2=80=9D: =E2=80=9Cbob@example.com"
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> Now the set of KVs for =E2=80=9Cfoo=E2=80=9D looks like this (not=
e that a missing
> >>>> field needs to be handled explicitly):
> >>>>>>>
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D) =3D kCONFLICT
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D, =E2=80=9C1-abc=
=E2=80=9D) =3D true
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D, =E2=80=9C1-def=
=E2=80=9D) =3D kMISSING
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D) =3D kCONFLICT
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-abc=
=E2=80=9D) =3D =E2=80=9Calice=E2=80=9D
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-def=
=E2=80=9D, =E2=80=9Cname=E2=80=9D) =3D =E2=80=9Cbob=E2=80=9D
> >>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-def=
=E2=80=9D, =E2=80=9Cemail=E2=80=9D) =3D =E2=80=9Cbob@example.com=E2=80=9D
> >>>>>>>
> >>>>>>> I like this approach for the common case where documents share
> most of
> >>>> their data in common but have a conflict in a very specific field or
> set of
> >>>> fields.
> >>>>>>>
> >>>>>>> I=E2=80=99ve encountered one important downside, though: an edit =
that
> >>>> replicates in and conflicts with the entire document can cause a bit
> of a
> >>>> data explosion. Consider a case where I have 10 conflicting versions
> of a
> >>>> 100KB document, but the conflicts are all related to a single scalar
> value.
> >>>> Now I replicate in an empty document, and suddenly I have a kCONFLIC=
T
> at
> >>>> the root. In this model I now need to list out every path of every
> one of
> >>>> the 10 existing revisions and I end up with a 1MB update. Yuck. That=
=E2=80=99s
> >>>> technically no worse in the end state than the =E2=80=9Czero sharing=
=E2=80=9D case
> above,
> >>>> but one could easily imagine overrunning the transaction size limit
> this
> >>>> way.
> >>>>>>>
> >>>>>>> I suspect there=E2=80=99s a smart path out of this. Maybe the sys=
tem
> detects a
> >>>> =E2=80=9Cdefault=E2=80=9D value for each field and uses that instead=
 of writing out
> the
> >>>> value for every revision in a conflicted subtree. Worth some
> discussion.
> >>>>>>>
> >>>>>>> ## Revision Trees
> >>>>>>>
> >>>>>>> In CouchDB we currently represent revisions as a hash history tre=
e;
> >>>> each revision identifier is derived from the content of the revision
> >>>> including the revision identifier of its parent. Individual edit
> branches
> >>>> are bounded in *length* (I believe the default is 1000 entries), but
> the
> >>>> number of edit branches is technically unbounded.
> >>>>>>>
> >>>>>>> The size limits in FoundationDB preclude us from storing the enti=
re
> >>>> key tree as a single value; in pathological situations the tree coul=
d
> >>>> exceed 100KB. Rather, I think it would make sense to store each edit
> >>>> *branch* as a separate KV. We stem the branch long before it hits th=
e
> value
> >>>> size limit, and in the happy case of no edit conflicts this means we
> store
> >>>> the edit history metadata in a single KV. It also means that we can
> apply
> >>>> an interactive edit without retrieving the entire conflicted revisio=
n
> tree;
> >>>> we need only retrieve and modify the single branch against which the
> edit
> >>>> is being applied. The downside is that we duplicate historical
> revision
> >>>> identifiers shared by multiple edit branches, but I think this is a
> >>>> worthwhile tradeoff.
> >>>>>>>
> >>>>>>> I would furthermore try to structure the keys so that it is
> possible
> >>>> to retrieve the =E2=80=9Cwinning=E2=80=9D revision in a single limit=
=3D1 range query.
> Ideally
> >>>> I=E2=80=99d like to proide the following properties:
> >>>>>>>
> >>>>>>> 1) a document read does not need to retrieve the revision tree at
> all,
> >>>> just the winning revision identifier (which would be stored with the
> rest
> >>>> of the doc)
> >>>>>>> 2) a document update only needs to read the edit branch of the
> >>>> revision tree against which the update is being applied, and it can
> read
> >>>> that branch immediately knowing only the content of the edit that is
> being
> >>>> attempted (i.e., it does not need to read the current version of the
> >>>> document itself).
> >>>>>>>
> >>>>>>> So, I=E2=80=99d propose a separate subspace (maybe =E2=80=9C_meta=
=E2=80=9D?) for the
> revision
> >>>> trees, with keys and values that look like
> >>>>>>>
> >>>>>>> (=E2=80=9C_meta=E2=80=9D, DocID, IsDeleted, RevPosition, RevHash)=
 =3D [ParentRev,
> >>>> GrandparentRev, =E2=80=A6]
> >>>>>>>
> >>>>>>> The inclusion of IsDeleted, RevPosition and RevHash in the key
> should
> >>>> be sufficient (with the right encoding) to create a range query that
> >>>> automatically selects the =E2=80=9Cwinner=E2=80=9D according to Couc=
hDB=E2=80=99s arcane
> rules,
> >>>> which are something like
> >>>>>>>
> >>>>>>> 1) deleted=3Dfalse beats deleted=3Dtrue
> >>>>>>> 2) longer paths (i.e. higher RevPosition) beat shorter ones
> >>>>>>> 3) RevHashes with larger binary values beat ones with smaller
> values
> >>>>>>>
> >>>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >>>>>>>
> >>>>>>> OK, that=E2=80=99s all on this topic from me for now. I think thi=
s is a
> >>>> particularly exciting area where we start to see the dividends of
> splitting
> >>>> up data into multiple KV pairs in FoundationDB :) Cheers,
> >>>>>>>
> >>>>>>> Adam
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnewson@apache.org>
> wrote:
> >>>>>>>>
> >>>>>>>> This one is quite tightly coupled to the other thread on data
> model,
> >>>> should we start much conversation here before that one gets closer t=
o
> a
> >>>> solution?
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Robert Samuel Newson
> >>>>>>>> rnewson@apache.org
> >>>>>>>>
> >>>>>>>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote:
> >>>>>>>>> This is a beginning of a discussion thread about storage of edi=
t
> >>>>>>>>> conflicts and everything which relates to revisions.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Professional Support for Apache CouchDB:
> >>>> https://neighbourhood.ie/couchdb-support/
> >>>>
> >>>>
> >>>
>
>

--000000000000bde5d605815c4d87--